Fun with MongoDB replica sets

Though we’ve already established how much we like MongoDB, we felt it appropriate to keep the love train going because of another great feature: MongoDB replica sets and how we’re using them at foursquare.

Replica sets are super useful because they provide automated replication and failover, even in sharded clusters. Previously our clusters were running master/slave replication, which worked well but required us to do a lot of manual work if we needed to fail over to a slave. It also required us to hack in our own solutions to make the slaves available for read-only queries. Replica sets directly addresses both of these problems.

Replica sets have been available since MongoDB 1.6, though some more interesting refinements were rolled out in 1.8, which made our transition to replica sets go a little more smoothly. What’s nice about them is their simplicity. As you’ll see, setting up automated failover for a sharded database cluster is as simple as starting up mongo server instances and entering some commands in the mongo shell. Even the migration path from master/slave replication is pretty straightforward.

We won’t go too much into how replica sets work. Instead, we’ll describe how we set up a few of our replica set configurations and the motivations behind them.

1. Straight-up replica set with arbiter node

We have a couple single-master/single-slave clusters that we converted to a replica set with a primary and secondary node plus an arbiter node. This is probably the least complex configuration for a replica set in our database setup. To do the conversion, we updated all the mongod configuration files with the following options:

replSet = auxdb
fastsync = true
rest = true

The fastsync option allows nodes to come online quickly if they have data that is relatively up to date with the primary node (i.e. the oplog is large enough on the secondary node to have buffered writes from the primary while being restarted). The rest option enables a really cool admin UI for diagnostic purposes, which is useful if you’re running an external monitoring tool like Ganglia or Nagios.

We also added an ec2 instance dedicated to running mongods as arbiters. These mongod processes are lightweight and only participate in an election process. Though running a single ec2 instance for this purpose introduces a single point of failure, we felt that having to maintain just one node was a reasonable trade off for our less critical clusters.

After restarting the master mongod with the new configuration options, we executed the following commands in the mongo shell:

connecting to: auxdb-0:27017/admin
> rs.initiate() // initialize with a fresh replica set config
{
"info" : "Config now saved locally.  Should come online in about a minute.",
"ok" : 1
}
auxdb:PRIMARY> rs.add("auxdb-1:27017") // add another node
{ "ok" : 1 }
auxdb:PRIMARY> rs.addArb("auxdb-arbiter:27017") // add the arbiter node
{ "ok" : 1 }

At this point, we update our app code to pass the host/port list of the replica set members to the MongoDB driver to configure the connection to the replica set. Note we specify port 27017 in the config; MongoDB assumes that port by default, but we specify it here for clarity. The driver automatically figures out which node in the set is the primary node should a failure occur. Magic!

2. Replica set with dedicated read slaves and single backup node

We have a few clusters that are particularly read-heavy. Replica sets give us read-scaling with the "slaveOk" option, which we can set on a per-connection basis with "mongo.slaveOk()". Secondary nodes receive read queries, but all writes still go to the primary node. In addition, we keep a single backup node and prevent it from receiving any read queries by configuring it as a hidden node. We do this so that new nodes can perform an initialSync from it without impacting production traffic.

Again, we update the configuration files, restart the mongods, and connect to the cluster with the mongo shell:

connecting to: db-0:27017/admin
> rs.initiate()
{
"info" : "Config now saved locally.  Should come online in about a minute.",
"ok" : 1
}
cluster2:PRIMARY> rs.add("db-1:27017")
{ "ok" : 1 }
cluster2:PRIMARY> rs.add("db-2:27017")
{ "ok" : 1 }
cluster2:PRIMARY> rs.add("db-3:27017")
{ "ok" : 1 }
cluster2:PRIMARY> rs.add({ "_id" : 4, "host" : "db-backup:27017", "priority" : 0, "hidden" : true })
{ "ok" : 1 }

db-backup:27017 is like a regular secondary node that receives updates from the primary node, but because it’s hidden it doesn’t receive queries and is ineligible to become primary should an election be triggered. New nodes that we bring online are configured like this:

cluster2:PRIMARY> rs.add({ "_id" : 5, "host" : "db-4:27017", initialSync : { name : "db-backup:27017" })

3. Shard on top of replica set with dedicated read slaves

This is like the previously described setup except we back a shard with a replica set, and mongos is configured with the replica set host/port information. This setup combines some of MongoDB’s most powerful features in a fashion similar to how drives are organized in a RAID 1+0 array. We find that this configuration simplifies our operational management for our larger clusters and allows us to scale out horizontally with ease.

We had already created a sharded database configuration and set up mongos and mongod config servers. In addition, each shard was already running a master/slave setup. To convert these clusters to use replica sets, we just went through the steps in number 2 above for each node in the shard and reconfigured the mongos instance:

connecting to: mongos-shard-0:30000/admin
> db.shards.update({ _id : "shard0000" }, {"$set" : { "host" : "sharddb/shard-0-a:27017,shard-1-a:27017,shard-2-a:27017,shard-3-a:27017,shard-backup-a:27017"} })

One nice thing about this is our app driver config doesn’t need to change. mongos is smart enough to not only determine which node in the set is the primary node, but also find other nodes in the set that weren’t added to the shard configuration initially. We also set slaveOk() in our connection setup and fan out our reads across the nodes in the shard.

That’s it!

If this all looks simple, it’s because it is. We’ve been very pleased with how replica sets have smoothed out the bumps in handling operational emergencies as well as for general maintenance and scaling out. Our next step is to configure our replica sets for data center awareness, which is not fully supported but workable with the latest version of MongoDB. Yay Mongo!

- Leo Kim, foursquare engineer