Gson Gotchas on Android

This is Part 2 in our 2 part series on latency. In Part 1, we discuss some techniques we use for measuring latency at Foursquare. Here we’ll discuss some specific Gson related changes we made to improve performance in our Android apps.

Shortly after the launch of Foursquare for Android 8.0, we found that our app was not as performant as we wanted it to be. Scrolling through our homepage was not buttery smooth, but quite janky with frequent GC_FOR_ALLOC calls in logcat. Switching between activities and fragments was not as quick as it should be. We investigated and profiled, and one of the largest items that jumped out was the amount of time spent parsing JSON. In many situations, this turned out to be multiple seconds even on relatively modern hardware such as the Nexus 4, which is crazy. We decided to dig in and do an audit of our JSON parsing code to find out why.

In Foursquare and Swarm for Android, practically all interaction with the server is done through a JSON API. We use Google’s Gson library extensively to deserialize JSON strings into Java objects that we as Android developers like working with. Here’s a simple example that converts a string representation of a venue into a Java object:

This works, but we don’t actually need the whole JSON string to begin parsing. Fortunately, Gson has a streaming API that let’s us parse a JSON stream one token at a time. Here’s what a simple example of that would look like:

So we did this, but still didn’t see any significant speed up or smoother app performance. What was going on? It turns out that we were shooting ourselves in the foot with our usage of custom Gson deserializers. We use custom deserializers because there are times when we don’t want a strict 1:1 mapping between JSON and Java objects they deserialize to. Gson allows for this, and provides the JsonDeserializer interface to facilitate this:

The way you use this is you implement this interface and tell it what type you want it to watch out for. You then register this with the Gson instance you are using to deserialize, and from then on whenever you try to deserialize some JSON to a certain type Type typeOfT, Gson will check to see if a custom deserializer is set up to handle that type and if so, will call that custom deserializer’s deserialize method. We use this for a few types, one of which happens to be our outermost Response type that encapsulates all Foursquare API responses:

The problem here is that despite us thinking we were using Gson’s streaming API, our usage of custom deserializers would cause whatever JSON stream we were trying to deserialize to be completely read up into a JsonElement object tree by Gson to be passed to that deserialize method (the very thing we were trying to avoid!). To make matters worse, doing this on our outermost response type that wraps every single response we receive from the server prevents any kind of streaming deserialization from ever happening. It turns out that TypeAdapters and TypeAdapterFactorys are what are now preferred and recommended over JsonDeserializer. Their class definitions look roughly like this:

Note the JsonReader stream being passed to the read() method as opposed to the JsonElement tree. After being enlightened with this information, we updated our custom deserializers to extend TypeAdapters and TypeAdapterFactorys and noticed significant parse time decreases of up to 50% for large responses. More importantly, the app felt significantly faster. Scroll performance that was previously janky from constant GCs due to memory pressure was noticeably smoother.


  • Use GSON’s streaming APIs, especially in memory-constrained environments like Android. The memory savings for non-trivial JSON strings are significant.
  • Deserializers written using TypeAdapters are generally uglier than those written with JsonDeserializers due to the lower level nature of working with stream tokens.
  • Deserializers written using TypeAdapters may be less flexible than those written with JsonDeserializers. Imagine you want a type field to determine what an object field deserializes to. With the streaming API, you need to guarantee that type comes down in the response before object.
  • Despite its drawbacks, use TypeAdapters over JsonDeserializers as the Gson docs instruct. The memory savings are usually worth it.
  • But in general, avoid custom deserialization if at all possible, as it adds complexity

Interested in these kinds of problems? Come join us!

Matthew Michihara

Measuring user perceived latency

At Foursquare, tracking and improving server-side response times is a problem many engineers are familiar with. We collect a myriad of server-side timing metrics in Graphite and have automated alerts if server endpoints respond too slowly. However, one critical metric that can be a bit harder to measure for any mobile application is user perceived latency. How long did the user feel like they waited for the application to startup or the next screen to load after they’ve tapped a button? Steve Souder gives a great talk about the perception of latency in this short talk.

For a mobile application like Foursquare, user perceived latency is composed of several factors. In a typical flow, the client makes an HTTPS request to a server, the server generates the response, the client receives a response, parses the response and then renders it. Client Timing Diagram

We instrumented Foursquare clients to report basic timing metrics in an effort to understand user perceived latency for the home screen. Periodically, the client batches and reports these measured intervals to a server endpoint which then logs the data into Kafka. For example, one metric the client reports is the delta between when the client initiated a request and when the first byte of the response is received. Another metric the client reports is simply how long the JSON parsing of the response took. On the server-side, we also have Kafka logs of how long the server spent generating a response. By combining client-side measured timings with server-side measured timings using Hive, we are able to sketch a rough timeline of user perceived latency with three key components: Network transit, server-side time, and client-side parsing and rendering. Note that there are many additional complexities within these components, however this simple breakdown can be a useful starting point for further investigation. .

The above bar chart shows a composite request timeline that is built using the median timing of each component from a sample of 14k Foursquare iPhone home screen requests. In this example, the user might wait nearly two seconds before the screen is rendered, and most of it was actually due to network and client time rather than server response time. Let’s dive into network and client time deeper.

Network time

The next chart below splits out requests in Brazil versus requests in the US.

The state of wireless networks and the latency to datacenter are major factors in network transit time. In the above comparison, the median Brazil request takes twice as long as one in the US. At Foursquare, all API traffic goes through SSL, to protect user data. SSL is generally fine for a connection that has already been opened, but the initial handshake can be quite costly as it typically requires two round-trips additional to a typical HTTP connection. It’s absolutely critical for a client to reuse the same SSL connection between requests, or this penalty will be paid each time. Working with a CDN to provide SSL early termination can also be incredibly beneficial at reducing the cost of your first request (often the most important one, since the user is definitely waiting for it to finish). For most connections, the transmission time is going to dominate, especially on non-LTE networks. To reduce the number of packets sent over the wire, we eliminated unnecessary or duplicated information in the markup and we were able to cut our payload by more than 30%. It turns out, however, that reducing the amount of JSON markup also had a big impact on the time spent in the client.

Client time

The amount of time spent processing the request on the client is non-trivial can vary wildly depending on the hardware. The difference in client time in the US vs Brazil chart is likely due to the different mix of hardware devices in wide use in the market. For example, if we were to plot the median JSON parsing times across different iPhone hardware, we would see a massive difference from older iPhone 4’s to the latest iPhone 6’s. Although not as many users are on the older hardware, it’s important to understand just how much impact needless JSON markup can have.

In addition to JSON processing, another important topic for iOS devices is Core Data serialization. In our internal metrics, we found that serializing data into Core Data can be quite time consuming and is similarly more expensive for older hardware models. In the future, we are looking at ways to avoid unnecessary Core Data access.

A similar variation can be found across Android hardware as well. The chart below shows the median JSON parsing times of various Samsung devices, (note that the Android timing is not directly comparable to the iPhone timing, as the Android metric is measuring the parsing of the JSON markup to custom data structures while the iPhone measurement is parsing straight to simple dictionaries). android_parse

In our next engineering blog post, we will discuss some critical fixes that were made in Android JSON parsing.


Measurement is an important first step towards improving user perceived latency. As Amdahl’s law prescribes, making improvements on the largest components of user perceived latency will of course have the largest user impact. In our case, measurements pointed at taking a closer look at networking considerations and client processing time.

– Ben Lee (@benlee) & Daniel Salinas (@zzorba42)

Geographic Taste Uniqueness

Last August we launched Tastes to help our users customize their local search experience. Taste tags like “trendy place”, “pork buns”, or “romantic restaurant” not only help users find the kinds of places they like when out and about, but also allow us to answer, for the first time, the question of “What is this area known for?”.

Taste data is a 2-way street. Not only are our users making use of tastes to personalize their experiences within the app, but every venue that we have external and user generated content for has it’s own unique taste profile as well. Making use of many input sources we are able to reliably attach tastes to the venues within the Foursquare venue database and calculate how strongly affiliated each of the applied tastes are with a given venue with a single affinity score. Applying our NLP stack to analyze user tips at a venue, we are able to distill that data into several metrics and scores (i.e. sentiment score, quality score, spam-like measure, etc.) that feed directly into the affinity score. Additionally, explicit data from users in the form of ‘Rate Places’ votes that signal which tastes our users liked at a venue is also incorporated into that final score.

Once tastes and their affinity scores are applied to our venues we can dig into our data science tool chest and use Old Faithful, TF-IDF, to find the tastes that are most unique in a given geographic region. TF-IDF is typically used to measure the importance of a term within a particular document that belongs to a larger corpus of documents. However, for our geographic taste measurement scores we have to modify the traditional understanding of what terms, documents, and corpora mean. Given the task of trying to identify the most important tastes of a sub-region in comparison to the region as a whole, we treat each venue as a single document, the tastes that are attached to the venues as the terms, and the affinity for a specified taste as the term frequency. Finally, we aggregate the venues, v, by the sub-region R we wish to measure and apply the following customized formula to find the taste uniqueness of taste t in R:

GeoTaste TFIDF

This formula is applied to every taste for a specified sub-region, producing a ranking of how unique every taste is to that sub-region. We then took the top 50 tastes ranked by uniqueness and resorted based on their affinity to the sub-region in order to find the most frequently seen tastes among the most unique. Every week, as part of our Hadoop data processing pipeline, we calculate these scores on various pairings of region and sub-region (US vs US state, US vs US city, city vs neighborhood) and used the final rankings to produce the “top” tastes in each sub-region.

The tables below represent a sampling of the results from this work.

Screen Shot 2015-03-05 at 6.18.07 PM

When we first generated this data, we immediately knew it would make a great feature in the Foursquare app. With a few changes to our search pipeline, we were able to surface them as quick links for users visiting these neighborhoods:  


Chinatown, NY, NY


Mission District, San Francisco, CA


We’ve just scratched the surface of digging into this data. If tackling these kinds of data analysis problems and working with an amazing dataset (and incredible co-workers) interests you, come join us!

– Will Bertelsen (@wbertelsen) & Kris Concepcion (@kjc9)

Announcing the first Foursquare API Demo Day!

Every couple of weeks we have an internal demo day – an hour where people demo things they’ve been working on to the rest of the company. Demos can be anything from a prototype app feature to a cool data visualization to a command-line tool. It’s a fun way to get inspired and to find out what co-workers are up to.

Now we’re inviting you to share your creativity in the same way, by showcasing a mobile, web or wearables app at the first ever public Foursquare API Demo Day, on November 12th at our New York headquarters.

The idea is simple: You build a prototype that uses the Foursquare API, and then demo it to Foursquare CEO and co-founder Dennis Crowley, Foursquare executives, product managers and engineers, and of course other participants. Your demo doesn’t need to be polished, it just needs to sort-of work. Some of our hackiest demos have turned into major Foursquare and Swarm features.

You can see more details, and sign up, here:

This isn’t a hackathon: there’s no competition, no artificial time constraint, and no caffeine-fueled all-nighters. You can build your demo on your own time, using any existing code. You can work in a team or solo. And if you have questions about our API as you go, we’ll be happy to answer them.

Coding isn’t a competitive sport: At this event all participants get an equal opportunity to wow and be wowed, and to make connections with Foursquare engineers and PMs and with other participants. And, who knows? Maybe you’ll have the opportunity to take your idea to the next level and put it in the hands of millions of Foursquare users.

Got a great idea on how to use Foursquare data? We want to see it! Sign up now and we’ll see you, and your demo, on November 12th!

Exploring the Foursquare ‘Taste Map’

In order to deliver great personalized local recommendations, Foursquare needs to understand not only which places are the best, but also what makes places all over the world different from each other. Whether it’s a dive bar with great late night dancing or a molecular gastronomy restaurant with an amazing tasting menu, we want to categorize these places and understand the relationship that Foursquare users have with them. That’s why this summer we launched “Tastes,” short descriptive tags for venues to help users personalize their experience and find places that suit them. Tastes can be as simple as a favorite dish like “soup dumplings” or a vibe like “good for dates”.

To better understand what our taste data looks like, I created the “Foursquare Taste Map.” Here we see a visualization of the most popular three thousand English tastes. Each taste is connected with a line to others like it, and they are arranged so that similar tastes are closer together. For more technical folks, this is a spring embedding of the k-nearest neighbor graph of tastes using the cosine similarity metric (plotted in Gephi), where each taste is represented as a high-dimensional vector of venue affinities.


Taste Map (thumbnail)

Obviously it’s difficult to capture all of the relationships between these tastes on a single page, but you can still see amazing structure emerge like “wine island” on the far right, or various niches of Asian cuisine in the lower left hand corner, or a variety of different hubs emerge around common dishes like “seafood”, “chicken”, and “pizza.” We are so excited to have the opportunity to work with this unique data set to better understand all of the places in the world, and thought you’d enjoy this visualization.

Asian Cuisine Close-Up

The Foursquare Taste Map
(don’t forget to zoom in and scroll around, and add your favorite tastes to your Foursquare account)

Happy exploring!

Blake (@metablake)

Introducing Pants: a build system for large-scale codebases like Foursquare’s.

Foursquare and Swarm are written predominantly in Scala on the server side. But as we’ve grown, so have the size, complexity and diversity of our codebase:

  • We currently have around 700,000 lines of handwritten code in 6500 .scala files.
  • We generate about 1.9 million lines of Scala code from 1400 .thrift files using Spindle, our homegrown data model code generator.
  • We generate UI code from Closure Templates.
  • We compile CSS using Less.
  • We have a significant amount of Python code for running our deploy toolchain, data processing pipelines and other offline systems.
  • Like all large codebases, we also have little bits of other things here and there: some C, some Java, some Ruby, some Bash scripts, and so on.

Naturally there is a complex web of dependencies between different parts of the codebase. In fact our code dependency graph has about 2500 vertices and tens of thousands of edges.

We needed a build toolchain that would work well with this complexity, and the result is Pants, an open source build system we developed together with engineers from Twitter, Square and elsewhere.

Pants was designed to support large-scale, heterogeneous codebases. It uses fine-grained dependency management to build only what you need, keeping build times from getting unnecessarily long (a must when using Scala, with its slow compilation speeds). Pants also makes it straightforward to plug in custom build tools, such as code generators, and it supports every part of the build lifecycle: codegen, external dependency resolution, compilation, linting, test running, bundling deployable artifacts and more.

You can read more about Pants here, including the etymology of the name. If your codebase is growing beyond your toolchain’s ability to scale, you might want to give Pants a try. And of course we’re always looking for contributors to the project!

What today’s announcement means for developers

Update (August 2014): For the latest on our plans for the API, be sure to read our entire recent update for developers.

Today we announced some big news about the products we work on every day here at Foursquare. Read about it if you haven’t already—it’s some pretty exciting stuff. But what do these changes mean for the thousands of developers that rely on our public API?

If you are a developer, in the short run, this will have no effect on your apps. Your users will continue to be able to connect their Foursquare accounts to your app and the way to access the API remains unchanged. In the long run, look for a few coming changes but even more exciting features being added to the API as Swarm and Foursquare grow and evolve. As always, we plan on giving plenty of advanced notice before we make any major changes to the API.

While we have your attention though, take this opportunity to do some spring housecleaning for your Foursquare app! Make sure you follow @foursquareAPI and that the email address associated with your Foursquare app is up-to-date—we use both for making important announcements. And finally, are you ready for our upcoming versioning changes?

As always, you have these channels for questions and support: email, Twitter, StackOverflow, and (new!) AirPair.

Looking forward to Mongo 2.6: A deep dive into the new write commands

We’ve been longtime Mongo users at Foursquare and each new version has brought enhancements in performance, reliability, and features. Mongo 2.6 has a bunch of new features that we’re excited about, but I’m going to focus on just one which might be easy to gloss over if you’re looking at the release notes.

Write operations will be sent synchronously over the wire allowing for some simplifications to mongo internals that should yield performance benefits.


Up until 2.6 all write operations were sent asynchronously over the wire. They were completely fire and forget. When a write command packet was sent to Mongo over a TCP connection, no bytes were returned by the server. In order for a client to get feedback on success, a “getLastError” command had to be sent separately on the same TCP connection with the developer’s durability preferences. The durability preference is referred to as the WriteConcern and might be something like “replicate to two servers” or “fsync to one server”. All the major mongo client drivers abstracted that async behavior so that it looked synchronous, but there were some real negative consequences.

Because the “getLastError” command could be sent at some future point after a write operation, clients had to have their TCP connections pinned for the duration of their sessions. That meant that there were constraints on connection pooling within the drivers and within the mongoS sharded router. Additionally, both the mongoS router and the mongoD servers had to keep track of the accounting of failed writes in order to match up future requests for feedback. Everything could be much simpler if write operations just blocked the client and returned their results.

With mongo 2.2 we hit some problems related to connection growth on our primary servers and mongodb engineers created a patch to allow for better connection reuse on the mongoS []. That was a bit of a hack but it has worked very well for our use case.

New write commands

In 2.6, write operations are now sent using the existing command infrastructure that query, count, getlasterror, and all the operational commands utilize. Unlike the old fire and forget write operations, the new command based write operations send a reply for every request. Each type of write command also supports batching for more efficient network utilization and reliability. In the case of any failures within a batch, the reply will contain error messages for each of the operations that failed.

With 2.6, we hope to see further performance and scalability enhancements in our sharded clusters. The mongoS will do a much better job connection pooling because it will be able to reuse connections since they will no longer need to be pinned to each client connection. That should cause the connection counts to the mongoD servers to drop and allow us to add even more clients in the future without worrying about running into limits.

How it works

The drivers will continue to abstract things in a similar way, but for those interested on how the new write operation commands actually work, read further:

There are just a few different types of packets that can be sent over the wire. All queries and commands are sent using an OP_QUERY packet. OP_INSERT, OP_UPDATE, and OP_DELETE are the way to do write operations prior to 2.6 and the big change is that writes are now sent as commands using the OP_QUERY packet.

Commands and queries are both sent using the same packet over the wire and a simple convention is used to tell the server that the packet is a command: the “fullCollectionName” will be of the form “dbname.$cmd” and the “query” will be a BSON document with the command name and parameters.

For a new 2.6 update command, the document will look something like this:

 "update": "collectionName",
 "updates": [
     "q": {...}, // the query
     "u": {...}, // the update
     "multi": false,
     "upsert": false
 "ordered": true, // should errors stop the batch, true by default
 "writeConcern": {"w": 1}

Notice that the “updates” key is actually an array. That’s used to support batching of writes. A single update will simply be a batch of one. The “ordered” flag determines whether an error in one item in the batch will terminate further processing or if it will continue on despite the errors. If the flag is set to true, each operation will be executed sequentially, otherwise mongo can split the batch up and execute things out of order in the case where the operation affects multiple shards of a cluster.

As I mentioned above, the command is synchronous and the reply might look like this in the case of an upsert:

 "ok" : 1,
 "nModified" : 0,
 "n" : 1,
 "upserted" : [{
   "index" : 0, // matches the index of the update batch
   "_id" : ObjectId("532f9406fec9bb9b1bee9290")

If there was an error with one or more of the operations, the response might look like this in the case of an insert:

 "ok" : 1,
 "n" : 0,
 "writeErrors" : [{
   "index" : 0, // matches the index of the insert batch
   "code" : 11000,
   "errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.c.$_id_ dup key: { : ObjectId('532f941bfec9bb9b1bee9291') }"

If I had batched multiple inserts and the first failed but I had “ordered” set to false, the rest would be attempted. Otherwise the command would stop processing.

I hope this has been interesting for you. Keep in mind that the performance benefits that we’re excited about are completely theoretical until tested out in production. We aren’t using 2.6 in production yet, but plan on rolling it out slowly in the coming months. If you’re going to try 2.6 for yourself, you’ll need to upgrade your client driver to take advantage of the new command infrastructure.

Jon Hoffman
Foursquare Engineer

Good Tech Lead, Bad Tech Lead

A brief guide to tech leadership at Foursquare, inspired by Ben Horowitz’s Good Product Manager, Bad Product Manager.


Good tech leads act as a member of the team, and consider themselves successful when the team is successful. They take the unsexy grungy work and clear roadblocks so their team can operate at 100%. They work to broaden the technical capabilities of their team, making sure knowledge of critical systems is not concentrated in one or two minds.

Bad tech leads take the high-profile tasks for themselves and are motivated by being able to take credit for doing the work. They optimize locally, keeping team members working on projects that benefit the team at the expense of the engineering organization at large.

Technical vision

Good tech leads have an overall vision for the technical direction of the product and make sure the team understands it. They delegate feature areas to other team members and let them own their decisions. They recognize that their team members are smart, trust them, and rely on them to handle significant pieces of the project.

Bad tech leads resist explaining or clarifying the technical direction and dictate decisions instead. They keep critical institutional knowledge in their heads, failing to multiply their effectiveness by creating and disseminating helpful documentation.

Discussion and debate

Good tech leads listen and encourage debate. When the team is unable to resolve a debate, they describe a process or framework of thinking that would help them resolve it. They don’t enter discussions with foregone conclusions, and always allow themselves to be persuaded by great ideas.

Bad tech leads allow debates to go on for too long without resolution, hampering the productivity of the team. Others cut off debate prematurely, dismissing new discussions by saying the matter is “already settled.” Bad tech leads believe it is more important that they win the argument than that the team reaches the right decision.

Amazing comic courtesy @blackmad

Project management

Good tech leads are proactive. They make sure technical progress is on track. They work with team members to come up with estimates and to establish intermediate milestones. They anticipate areas of concern and make sure they are addressed before they become a problem. They identify technical roadblocks and help the team get around them. They identify areas of overlap where work can be shared, and conversely, find areas that are not getting enough attention and direct resources toward it.

Bad tech leads are reactive. They may delegate, but do not follow up to make sure progress is being made. They don’t set intermediate goals and hope that everything just comes together in the end. They wait until just before launch to do end-to-end tests of complex systems. They allow team members to waste time on interesting but unimportant work.


Good tech leads are pragmatic and find a balance between doing it right and getting it done. They cut corners when it’s expedient but never out of laziness. They encourage their team to find temporary shortcuts or workarounds to problems that are blocking overall progress, and to build minimum viable infrastructure for launch. To good tech leads, details matter. Code quality, code reviews, and testing are just as important as shipping on time.

Bad tech leads take shortcuts that save time in the short term but cost more in the long term, and let technical debt pile up. They cannot distinguish between situations that call for expediency and those that call for perfection.


Good tech leads know that their role is much more than writing code, that effective communication is a vital part of their job, and that time spent making their team more efficient is time well spent. They acknowledge that some communication overhead is necessary when working on a team, and they sacrifice some personal productivity for overall team productivity.

Bad tech leads believe that they are most productive when they are writing code, and think communication is a distraction. They do not optimize for overall team productivity, but rather for what works best for themselves. They get frustrated when they have to take time to lead.

Relationship with Product

Good tech leads are in a conversation with product managers and designers about how the product should work. They are not afraid to push back on decisions they disagree with, but keep the product goals in mind and know when to accommodate them. They find creative workarounds to technical constraints by suggesting alternative product formulations that are less technically demanding, and help PMs and designers understand technical challenges so that they make informed trade-offs themselves.

Bad tech leads throw product decisions “over the wall” and do not take ownership of the product. They push back due to technical constraints but do not offer alternatives or explanations.


Good tech leads are resilient to changes to the product specification and react calmly to surprises. They anticipate where changes might take place and design their code to handle them.

Bad tech leads are upset when the specification changes, or prematurely generalize their design in areas where changes are unlikely to occur.


Good tech leads are easy-going but assertive. Bad tech leads are confrontational and aggressive. Good tech leads emerge naturally and earn respect through technical competence and experience. Bad tech leads think their title confers respect and authority. Good tech leads are always looking for ways to improve.

Bad tech leads get defensive when given feedback. Good tech leads are humble and boost the confidence of everyone else on the team. Bad tech leads are arrogant and take pleasure in making their teammates feel inferior.

-Jason Liszka (originally published on Medium)

Foursquare is hiring!

Mongo on Hadoop

At Foursquare, one of our most important pieces of data infrastructure is getting a copy of our production Mongo database into Hadoop. Today, we’re open-sourcing two parts of this job, a utility to dump Mongo to Hadoop, and code to read this data from MapReduce jobs.

Why Mongo on Hadoop?

Our Mongo database is the source of truth for everything at Foursquare. How many users do we have? How many people are checked into JFK right now? What’s the most-liked tip left by a friend at this bar? Ask Mongo. However, the database has to stay extremely responsive to reads and writes at all times; no one wants their check ins to show up ten minutes after they happen! This means that more expensive queries (such as looking at every user joined with every venue they’ve visited) cannot happen on Mongo. So how do we answer these questions? We dump the data to Hadoop!

What is Hadoop?

Hadoop is an open-source computation framework for massive sets of data. It has two main components: HDFS (Hadoop Distributed File System) for storing data, and MapReduce for running computations on that data. Foursquare’s current Hadoop cluster is about 100 servers in our datacenter, with 2.5 Petabytes of raw storage. Hadoop powers a lot of what we do here, from business and product decisions, to some of our coolest features (like the real-time notifications from the all-new Foursquare).

Having our Mongo data in Hadoop allows us to have the exact same view of the world that the app has. It lets us ask massive questions without any impact on the production database.

Getting The Data There

It starts with the Mongo replicas, which are also running in our datacenter. These nodes are all running independent LVM stacks beneath each mongod process (one LVM stack per physical SSD volume). Every six hours a process running on each node issues an LVM snapshot command for the disk on which a given mongo database runs. A central coordinator ensures that this happens as close as possible to simultaneously across all clusters. This creates a “sane” snapshot of the cluster. If shards were snapshotted at very different times, there could be records with foreign keys that don’t exist. These snapshots are archived and compressed locally, then uploaded to a specific directory named according to the snapshot group being taken in HDFS.

A separate process is continuously monitoring this directory, waiting for a complete cluster to be available (i.e., every shard from the cluster exists). Once that happens, the files are downloaded, decompressed, and extracted in parallel across several servers. When an individual file finishes downloading, we launch the mongodump utility to write the data back to HDFS. This data is in a format that’s easier to consume in MapReduce jobs. For example, all our checkin data up to and including 2014-01-01 is in a single directory, stored in Mongo’s BSON format: /datasets/mongo/bson/ident=checkins/version=2014-01-01/col=checkins/

Reading the Data

Having our entire Mongo database available in Hadoop is nice, but it’s not very useful unless it can be read by MapReduce jobs. Every Mongo collection that gets read in Hadoop has an associated Thrift definition, and a custom input format turns the BSON into the Thrift object. Since we use Scala, the Thrift objects are generated Scala code, using Spindle.

When an entire cluster is finished for a particular day, we put a marker file to indicate this to Hadoop jobs. Another process updates our Hive tables to point to the new data, entirely transparent to people writing Hive queries.

– Joe Ennever (@TDJoe), Data Infrastructure Engineer