Stability in the midst of chaos

I gave a talk at MongoNYC (video here!) about how we’ve built resiliency into our Mongo clusters on top of a volatile hardware environment.

We currently run all of our hardware infrastructure on top of Amazon’s EC2 platform. EC2 has allowed us to be really flexible as we’ve grown foursquare in usage and features, but it comes with a trade off of limited hardware options. The most important piece of hardware with regard to databases is the disk, and the disk options on ec2 are limited to local drives on the same machine as the hypervisor, or elastic block storage (EBS), which is network disk. We found that our IO requirements could not be met by the local machine storage, so we run our mongo instances on raided EBS volumes. The steady state performance of EBS is pretty good. However, as a network service, the latency and bandwidth can vary. In some situations, IO operations can be completely blocked for tens of seconds at a time.

Mongo has built-in failover across nodes in a replica set, and it handles outright machine or process failure well, but it does not failover in situations where the disk has degraded performance. For a long time, we reacted manually to those situations, but as the number of servers have grown, the rate of failure has increased and manual response is no longer feasible. This presentation outlines the steps we took to build custom tools to detect degraded disk conditions and the modifications we made to Mongo to automate the failover.

Mongo at foursquare: Stability in the midst of chaos

- Jon Hoffman (@hoffrocket)