MongoDB strategies for the disk-averse

Behind the scenes at foursquare, we have a lot of data collection efforts that present interesting scaling puzzles. One is the venue metrics system, which allows business owners to get information about checkins to their venue over time. It lets them see the effect of specials, understand their clientele’s demographics, and even identify their most loyal customers.

To store this data, we need to handle tens of writes per second across millions of venues, interleaved with infrequent reads of the last 90 days of data for a given venue. Ideally, reads would return within a second or two. This isn’t inherently hard, but we really want to minimize the resources dedicated to this small corner of our system.

We’re fans of MongoDB, and one natural way to store this information would be to have one document for every active (venue, hour) pair which contains various counters (men, women, etc.). We would take each new checkin and increment appropriate counters in the (venue, hour) record, creating the record if it doesn’t already exist. But because new records are just appended to the end of a memory-mapped file, every record about a venue would end up being on a different page amid records about other venues that just happened to be created at the same time.

If we hold all of the data live in RAM, this is no big deal. But as soon as we need to read from disk, every non-contiguous record translates into a page fault, which can be destructive to performance. To load 3 months of data, we’d need to fetch 2,000 hourly records, which would take tens of seconds if every record were on a different page and incurred page fault and disk seek overhead. (On our XXL EC2 machine with 4 EBS volumes in RAID0 on ext4, we seem to max out at 50-ish page faults per second.)

Locality, locality, locality

A lot of other database systems provide ways to establish disk locality. Oracle offers Index Organized Tables, MySQL has InnoDB, Tokyo Cabinet has its B-Tree store, and Bigtable stores data in range order. Unfortunately, these all come at the expense of write performance as data is rearranged to make room for new records (with the exception of Bigtable, which uses compactions to mitigate this effect).

MongoDB offers two options that address this, but neither is quite right for our application.

  • MongoDB’s collections feature lets us encourage some amount of disk locality. Every page in a database file is allocated to only one collection, so we would have the minimum possible page faults to load our data. They also reduce index size by pulling some information out of it. But although the number of collections is configurable to be higher than the default of 24,000, there’s still a hard limit well below millions.
  • MongoDB lets us grow an existing document, so that, for example, a given venue’s visits in a given month could be a single array that we keep appending to. Unfortunately, if the record exceeds available padding, the entire record needs to be rewritten to a new location, leading to some very expensive writes and fragmentation on disk. MongoDB determines how much padding to allocate by watching how records grow in a given collection, but if some records grow a lot and some not at all, this is hard to optimize. We’re experimenting with growing documents for check-in comments, but they simply did not provide the insertion rate we needed when trying to backfill existing stats data.

In the end, we settled on building locality at the application level. Every time we need to record the data for a given venue at a given hour, we align it to a five-hour period. We insert a record with a 0’ed out 5-element array (which will fail if that record already exists) and then update into the appropriate position. Roughly:

// TIMESTAMP is seconds since epoch, divided by (5 * 3600)
// It represents the 5 hours starting at TIMESTAMP
db.timeSeries.insert({
_id: {v:VENUE_ID, t: TIMESTAMP} vals: [0,0,0,0,0]})
db.timeSeries.update({
_id:  {v: VENUE_ID, t: TIMESTAMP}, { $inc: { ‘vals.1’: 1} })

They to query a range of this data:

db.timeSeries.find({ '_id': {
'$gte':  _id:  {v: VENUE_ID, t: TIMESTAMP_1},
'$lte': _id:  {v: VENUE_ID, t: TIMESTAMP_2} }})

Grouping every 5 hours of data adds some complexity to our application, but it takes a fifth as many disk seeks as storing each hour separately. How did we pick 5 hour chunks? Based on trial and error, this seemed to be a sweet spot given the sparseness of visits to venues, especially given that arrays in BSON are implemented as dictionaries and not quite as compact as traditional arrays.

Serialization and its disk-contents

There are a few other cool tricks we play in other parts of this data store:

  • Compound keys. As hinted above, MongoDB allows _id’s to be objects. This allows us to save on the number of indices we need. For example, to record the number of unqiue visitors in a given hour, we have a collection where all of the data lives in the _id, which is of the form { v: VENUE_ID, t: HOUR_SINCE_EPOCH, u: USER_ID }. The default index on _id already allows us to find all records matching a venue ID or a venue ID in a given time range. We have to be careful to use ordered dictionaries, provided by the SON libraries in PyMongo (equivalent features exist in other languages), not regular dictionaries. The order has to be consistent every time we insert or update and reflect the way in which we intend to query.
  • Covering indices. As of MongoDB 1.7.3, queries that can be answered completely by the index do not hit disk at all to find the document, assuming the index fits in RAM. In the case of the hours-uniques collection described above, the documents are empty and never loaded on queries.
  • Small keys. An oldie-but-goodie: we use small key names. Every key is repeated in every record and can be a significant part of record size if we’re just storing numbers. We keep a mapping from single letters to human-readable names in our application code.
  • Use the right type. Similar to small keys, storing numbers as numbers and object IDs as object IDs can produce significantly smaller data than playing tricks with strings.

We also toyed with splitting our data into separate databases by month. This would let us roll off old months of data easily, reducing both active index size and data size on disk. Capped collections could also achieve the same effect. Neither solution felt urgent, and we’re doing just fine keeping hundreds of gigabytes of data live with only 64GB of RAM.

In doing all of this, mongostat and iostat were our friends. Among other things, they helped us realize that covering indices weren’t implemented yet in Mongo 1.6 and that the last point release of EC2’s official Red Hat AMI actually could not address more than 32 GB of physical RAM.

Disk-saster averted!

One of the best things about MongoDB is that it’s very easy to reason about, making it easy to define performant schemas, rather than fiddling endlessly with confusing tuning parameters. Amazon’s EBS has abysmal I/O performance, even after RAID0 and ext4. But by understanding a little bit more about how MongoDB pages data off of disk, it’s still possible to squeeze out something reasonable.

Kushal Dave, foursquare engineer