Mongo on Hadoop

At Foursquare, one of our most important pieces of data infrastructure is getting a copy of our production Mongo database into Hadoop. Today, we’re open-sourcing two parts of this job, a utility to dump Mongo to Hadoop, and code to read this data from MapReduce jobs.

Why Mongo on Hadoop?

Our Mongo database is the source of truth for everything at Foursquare. How many users do we have? How many people are checked into JFK right now? What’s the most-liked tip left by a friend at this bar? Ask Mongo. However, the database has to stay extremely responsive to reads and writes at all times; no one wants their check ins to show up ten minutes after they happen! This means that more expensive queries (such as looking at every user joined with every venue they’ve visited) cannot happen on Mongo. So how do we answer these questions? We dump the data to Hadoop!

What is Hadoop?

Hadoop is an open-source computation framework for massive sets of data. It has two main components: HDFS (Hadoop Distributed File System) for storing data, and MapReduce for running computations on that data. Foursquare’s current Hadoop cluster is about 100 servers in our datacenter, with 2.5 Petabytes of raw storage. Hadoop powers a lot of what we do here, from business and product decisions, to some of our coolest features (like the real-time notifications from the all-new Foursquare).

Having our Mongo data in Hadoop allows us to have the exact same view of the world that the app has. It lets us ask massive questions without any impact on the production database.

Getting The Data There

It starts with the Mongo replicas, which are also running in our datacenter. These nodes are all running independent LVM stacks beneath each mongod process (one LVM stack per physical SSD volume). Every six hours a process running on each node issues an LVM snapshot command for the disk on which a given mongo database runs. A central coordinator ensures that this happens as close as possible to simultaneously across all clusters. This creates a “sane” snapshot of the cluster. If shards were snapshotted at very different times, there could be records with foreign keys that don’t exist. These snapshots are archived and compressed locally, then uploaded to a specific directory named according to the snapshot group being taken in HDFS.

A separate process is continuously monitoring this directory, waiting for a complete cluster to be available (i.e., every shard from the cluster exists). Once that happens, the files are downloaded, decompressed, and extracted in parallel across several servers. When an individual file finishes downloading, we launch the mongodump utility to write the data back to HDFS. This data is in a format that’s easier to consume in MapReduce jobs. For example, all our checkin data up to and including 2014-01-01 is in a single directory, stored in Mongo’s BSON format: /datasets/mongo/bson/ident=checkins/version=2014-01-01/col=checkins/

Reading the Data

Having our entire Mongo database available in Hadoop is nice, but it’s not very useful unless it can be read by MapReduce jobs. Every Mongo collection that gets read in Hadoop has an associated Thrift definition, and a custom input format turns the BSON into the Thrift object. Since we use Scala, the Thrift objects are generated Scala code, using Spindle.

When an entire cluster is finished for a particular day, we put a marker file to indicate this to Hadoop jobs. Another process updates our Hive tables to point to the new data, entirely transparent to people writing Hive queries.

– Joe Ennever (@TDJoe), Data Infrastructure Engineer

The Mathematics of Gamification

At Foursquare, we maintain a database of 60 million venues. And like the world it represents, our database is ever-changing, with users from all over the world submitting updates on everything from the hours of a restaurant to the address of a new barbershop. To maintain the accuracy of our venue database, these changes are voted upon by our loyal Superusers (SUs) who vigilantly maintain a watchful eye over our data for their city or neighborhood.

Like many existing crowd-sourced datasets (Quora, Stack Overflow, Amazon Reviews), we assign users points or votes based on their tenure, reputation, and the actions they take. Superusers like points and gamification. It rewards diligent, hard-working SUs (which are the majority) and punishes the few malicious “bad players.” But data scientists like probabilities and guarantees. We’re interested in making statements like, “we are 99% confident that each entry is correct.” How do we allocate points to users in a way that rewards them for behavior but allows us to make guarantees about the accuracy of our database?

At Foursquare, we have a simple, first-principles based method of resolving proposed venue attribute updates. We can gauge each Superuser’s voting accuracy based on their performance on honeypots (proposed updates with known answers which are deliberately inserted into the updates queue). Measuring performance and using these probabilities correctly is the key to how we assign points to a Superuser’s vote.

The Math

Let’s make this more concrete with some math. Let $H_0$ denote the true state of the world, either $1$ or $-1$, which we can interpret as a proposed update being true or false, respectively. We do not observe this but we know $H_0 = 1$ with a-priori probability $p_0$. User 1 votes $H_1$ (again, either $1$ or $-1$, representing “yay” or “nay”) with independent probability $p_1$ of agreeing with the truth $H_0$ and $(1-p_1)$ of disagreeing. Bayes’ Rule then gives us
\P(H_0 = 1 | H_1 = 1) & = \frac{\P(H_1 = 1 | H_0 = 1) \P(H_0 = 1)}{\P(H_1 = 1)} \\
& = \frac{p_0 p_1}{p_0 p_1 + (1-p_0)(1-p_1)} \\
& = \frac{\ell_0 \ell_1}{\ell_0 \ell_1 + 1}
\end{align*}where we have written the solution in terms of the likelihood ratio $\ell_k = \ell(p_k)$ given by
\[ \ell(p) = \frac{p}{1-p} \qquad \ell^{-1}(\cdot) = \frac{\cdot}{1 + \cdot}\,. \]Then we have that
\[ \P(H_0 = 1 | H_1 = 1) = \ell^{-1}(\ell(p_0) \ell(p_1))\,. \]In fact, it is easy to see that in the general case,
\[ \P(H_0 = 1 | H_1 = h_1) = \ell^{-1}(\ell(p_0) \ell(p_1)^{h_1})\,. \]Multiplication is hard so we will define the logit or log-likelihood function
\[ \logit: (0,1) \to (-\infty, \infty) \]given by
\[ \logit(p) = \log(\ell(p)) = \log\prn{\frac{p}{1-p}} \qquad \logit^{-1}(\cdot) = \frac{e^{\cdot}}{1 + e^{\cdot}}\,. \]Then we have
\[ \P(H_0 = 1| H_1 = h_1) = \logit^{-1}(\logit(p_0) + h_1 \logit(p_1))\,. \]

Continuing, assume that after user 1 casts their vote, user 2 votes $H_2$ with an independent probability $p_2$ of being correct (i.e. agreeing with $H_0$). We can think of the posterior probability $\P(H_0 = 1| H_1 = h_1)$ as our new prior and inductively repeat the above Bayesian analysis to obtain
\P(H_0 = 1 | H_1 = h_1, H_2 = h_2) & = \logit^{-1}\prn{\logit\prn{\P(H_0 = 1 | H_1 = h_1)} + h_2 \logit(p_2)} \\
& = \logit^{-1}(\logit(p_0) + h_1 \logit(p_1) + h_2 \logit(p_2))\,.
\end{align*}In fact, if we have $n$ votes $H_1, \ldots, H_n$, then we have
\begin{align} & \P(H_0 = 1| H_1 = h_1, \ldots, H_n = h_n) \nonumber\\
& \qquad\qquad = \logit^{-1}\prn{ \logit(p_0) + \sum_{k=1}^n h_k \logit(p_k) } \,. \label{eq:main}

The Solution

The above equation suggests that we should assign $s_k$ points or votes to user $k$ based on \begin{equation} s_k = \logit(p_k)\,. \label{eq:points} \end{equation} We can add up all the “yay” votes and subtract all the “nay” votes to obtain a score for the update. This score can easily be interpreted as a probability that the update is correct. We can set a certainty threshold $p$ (e.g. $p = 99\%$) as a threshold for a desired accuracy of this edit. Then, we accept a proposed edit as soon as \begin{equation} \logit(p_0) + \sum_{k=1}^n h_k \logit(p_k) \ge \logit(p) \label{eq:upper} \end{equation} and reject it as soon as \begin{equation} \logit(p_0) + \sum_{k=1}^n h_k \logit(p_k) \le – \logit(p)\,. \label{eq:lower} \end{equation}

In other words, if we take $t = \logit(p)$ to the the points threshold and $s_0 = \logit(p_0)$ to be the points allocated to a new proposed edit, then \eqref{eq:upper} and \eqref{eq:lower} become
\[ s_0 + \sum_{k=1}^n h_k s_k \ge t \]and
\[ s_0 + \sum_{k=1}^n h_k s_k \le – t\,, \]which are exactly the equations for voting you would expect. But now, they’re derived from math!

The Benefits

  • Efficient, data-driven guarantees about database accuracy. By choosing the points based on a user’s accuracy, we can intelligently accrue certainty about a proposed update and stop the voting process as soon as the math guarantees the required certainty.
  • Still using points, just smart about calculating them. By relating a user’s accuracy and the certainty threshold needed to accept a proposed update to an additive point system \eqref{eq:points}, we can still give a user the points that they like. This also makes it easy to take a system of ad-hoc points and convert it over to a smarter system based on empirical evidence.
  • Scalable and easily extensible. The parameters are automatically trained and can adapt to changes in the behavior of the userbase. No more long meetings debating how many points to grant to a narrow use case.
    So far, we’ve taken a very user-centric view of $p_k$ (this is the accuracy of user $k$). But we can go well beyond that. For example, $p_k$ could be “the accuracy of user $k$’s vote given that they have been to the venue three times before and work nearby.” These clauses can be arbitrarily complicated and estimated from a (logistic) regression of the honeypot performance. The point is that these changes will be based on data and not subjective judgments of how many “points” a user or situation should get.

Some practical considerations:

  • In practice, we might want a different threshold for accepting \eqref{eq:upper} versus rejecting \eqref{eq:lower} a proposed edit.
  • For notational simplicity, we have assumed that a false positives and false negatives in user $k$’s voting accuracy have the same probability $p_k$. In general, this is not the case. We leave it to the reader to figure the math of the general case.
  • Users like integer points. We have to round $s_k$ to the nearest integer. Because we can multiply linear equations like \eqref{eq:upper} and \eqref{eq:lower} by a positive constant, we can set $s_k = [\alpha \cdot \logit(p_k)]$ where $[\cdot]$ is the rounding function and $\alpha$ is a large positive constant. A large $\alpha$ will prevent the loss of fidelity.
  • We’ve explained how to obtain $p_1, p_2, \ldots$ from honeypots but how do we obtain $p_0$, the accuracy of newly proposed updates. One way is to use the above to bootstrap those accuracies from voting: we can use this voting technique to infer the accuracy of proposals by looking at what fraction of proposed updates are accepted!
  • Bayesian Smoothing. We assume a relatively low-accuracy prior for the accuracy of individuals. This is a pessimistic assumption that keeps new, untested users from having too much influence. It also rewards users for lending their judgment and casting votes as long as those are more accurate than our pessimistic prior. Of course, we also increase the likelihood of showing new Super Users honeypots to give them a chance to prove themselves.

–Michael Li
Data Scientist

Foursquare’s new notifications and the future of contextual mobile experiences

For the last year I’ve been obsessed with a new breed of mobile applications that are aware of a user’s context: who they are, where they are in the world, and what is going on around them.  Apps like Dark SkyGoogle Now, and Square Wallet are starting to enable amazing new real-world experiences that make users feel like they have superpowers by connecting them seamlessly to information.

Last month, we launched the new Foursquare notifications which automatically lets people know about the best dishes on the menu when they walk into a restaurant, or the top spots not to miss when they land in a new city.  In this talk at Data Driven NYC, I explain how we built this exciting new product from the data exhaust of millions of mobile devices, and how it sets the groundwork for an exciting new world of highly-targeted contextual experiences.

Data Driven NYC 20 // Blake Shaw of Foursquare from Matt Turck on Vimeo.

- Blake (@metablake)

A chat about data science and our fun visualizations

A little while back, I gave a talk on a Big Data Panel at the Stanford Graduate School of Business’s China 2.0 conference.  We had a great discussion about the uses of data science and the fun visualizations we do with our data at Foursquare. Check it out: 


How we built our Model Training Engine

At Foursquare, we have large-scale machine-learning problems. From choosing which venue a user is trying to check in at based on a noisy GPS signal, to serving personalized recommendations, discounts, and promoted updates to users based on where they or their friends have been, almost every aspect of the app uses machine-learning in some way.  All of these queries happen at a massive scale: we average one million Explore queries and six million check-ins every day. Not only do we have to process each request faster than the blink of an eye, but these millions of user interactions are giving us millions of data points to feed back into our models to make them better. We’ve been building out a Model Training Engine (MTE) to automate our (machine) learning from user data.  Here’s an overview to whet your appetite.

Fitting the model to the data rather than the data to the model.

Many models are built using linear regressions or similar approaches. While these models can help us quickly understand data (and we certainly make use of them), they make convenient but unrealistic assumptions and are limited in the kinds of relationships they can express. The MTE uses techniques liked Boosted Decision Trees or Random Forests (we have both a scikit-learn and an in-house MapReduce based implementation) to learn much more detailed and nuanced models that fit the data better.

Keeping models fresh and relevant.

With 6 million new check-ins a day, models quickly get stale. The MTE automatically retrains models daily based on the latest signals and the latest data. New signals and changes in old signals are immediately incorporated into new models and we monitor and deploy newer models when they outperform older ones.

Model training that scales with data and the organization.

With a large-scale, very interconnected system, changes made by other engineers on a seemingly unrelated app feature could throw off a very carefully calibrated model. How do we make model building scale across billions of check-ins and an entire organization without engineers stepping on each other’s toes?

To make models scalable across our data, we’ve rolled our own online learning algorithms and use clever downsampling techniques when we cannot load the entire dataset into memory. We use techniques like bagging and cross-validation to optimally understand how to combine different signals into a single prediction in a way that maximizes the contribution from each signal without picking up on spurious correlations (aka overfitting). This means that no one can throw off the model by adding or tweaking a signal. For example, If an engineer accidentally adds random noise (e.g. dice rolls) as a signal, the MTE would quickly detect that signal was not predictive and ignore it. This allows us to be open to new ideas and signals from pretty much anyone at the company, not just data scientists.

What’s more, the MTE can adapt to frequent UX and other product changes, all without human intervention. For example, if our mobile team changes the UI to make friends’ prior visits more prominent, the MTE will automatically detect that users are weighing social signals more heavily and adjust our models accordingly. And our automated Model Training Engine means that engineers can concentrate on building signals and let the model training select their best ones.

All of these quality improvements are translating into a better and smarter user experience. More details (with code) and quality improvements to come!

–Michael Li, Data Scientist

Foursquare Native Auth on iOS and Android: Developers, connect your users more quickly than ever

A few weeks ago we were excited to announce one of our most-wished-for features from our developer community, native authentication for iOS, and today we’re happy to announce we’ve also shipped support for native auth on Android in our latest release of Foursquare on Google Play! In a nutshell, this means that your users can connect their Foursquare accounts to your app without wrangling with messy WebViews and log-ins. Native authentication simply pops your users into the Foursquare app on their phone and lets them use their existing credentials there.

And even though this has only been out for a few short weeks, we love what our developers have been doing with it so far. If you want to see what native auth looks and feels like in the wild, install the latest version of quick check-in app Checkie: after using Foursquare to find a place for you and your friends to go, Checkie lets you check in with incredible speed.

Since Checkie uses our checkins/add endpoint, users need a way to log in. Below is what the app used to look like upon opening. Users are taken directly to a WebView where the user had to type in—and more importantly, remember, without the aid of Facebook Connect—their Foursquare credentials before continuing to use Checkie.

For this old flow to succeed, at least four taps are necessary, along with who knows how many keystrokes. Below is how the new Checkie flow works after integrating native auth: there’s a more informational screen when the app opens, and only two taps are necessary to begin actually using Checkie: “Sign in,” which bumps users to the Foursquare app where they can hit “Allow.”

How You Can Use Native Auth Today

You too can get started using this flow right away. We have libraries and sample code for iOS and Android available on GitHub that you can dive straight into. The details vary depending on OS, but the overall conceptual process is similar for both and outlined below—it should be familiar for those who have worked with 3-legged OAuth before.

  1. Update your app’s settings. You need to modify your app’s redirect URIs (iOS) or add a key hash (Android).

  2. Include our new libraries in your project. OS-specific instructions are found on their GitHub pages.

  3. Unless you want to use it as a backup mechanism, get rid of that (UI)WebView! Chances are, if you expect your users to have Foursquare accounts, they’ll have the app on their phones.

  4. Call our new native authorize methods. On iOS, it’s authorizeUserUsingClientId; on Android, it’s FoursquareOAuth.getConnectIntent then startActivityForResult with the returned intent. These methods bounce your users to the Foursquare app’s authorize screen or return appropriate fallback responses allowing them to download the app.

  5. If you user authorizes your app, your user will land back in your app. Follow OS-specific instructions to obtain an access code. This should involve calling either accessCodeForFSOAuthURL (iOS) or FoursquareOAuth.getAuthCodeFromResult (Android).

  6. Trade this access code for an access token. The access token (not access code) is what is eventually used to make calls on behalf of a particular user. There are two ways to do this:

    1. (Preferred) Pass the access token to your server, and then make a server-side call to—see step 3 under our code flow docs for details on the exact parameters needed. The response from Foursquare will be an access token, which can be saved and should be used to make auth’d requests. This method is preferable because it avoids including your client secret into your app. For more details, see our page on connecting.

    2. Call our new native methods to get an access token. On iOS it’s requestAccessTokenForCode. On Android it’s FSOauth.getTokenExchangeIntent followed by startActivityForResult (make sure you also make requisite changes to AndroidManifest.xml)

If you have any comments or questions about this new native auth flow—or anything API-related in general!—please reach out to

David Hu, Developer Advocate

Machine learning at Foursquare

In March, I spoke at Queens Open Tech about machine learning at Foursquare. The talk gives a nice overview of the kinds of insights we have about human behavior from check-in data and our machine-learning setup. Learn how we used smarter algorithms to get 20,000 people to try a new place every week.

- Michael Li, Data Scientist at Foursquare

Quattroshapes: A Global Polygon Gazetteer from Foursquare

Foursquare geographic infrastructure relies on numerous pieces of open geo software: PostGIS, GDAL, Shapely, Fiona, QGIS, S2, and JTS as well as open geographic data: OSM,, US Census’ TIGER, Canada’s geogratis, Mexico’s INEGI and EuroGeoGraphics to name a few. We’ve been inspired by existing efforts around geographic data including the alphashapes and betashapes projects. We are eager and excited to contribute back to the open geo ecosystem with a few projects that I demoed recently at foss4g-na and State of the Map US.



Geographic polygon / boundary data is important to us as a way to aggregate venues around places like cities and neighborhood. Finding a good source of city data around the world has proved difficult. For that reason, we’ve been curating a set of worldwide polygon data that we’re calling Quattroshapes. Quattroshapes debuted at Nathaniel Vaughn Kelso’s talk at State of the Map US this past weekend. The project combines normalizing open government data with synthesizing new polygons out of flickr photos and Foursquare checkin data in places where open government data is unavailable. It’s called quattroshapes because it’s the fourth iteration (that we know of) of the work flickr did on alphashapes and SimpleGeo on betashapes also, it’s based on a quadtree.

We use this polygon data in twofishes, our coarse, splitting, forward and reverse geocoder based on the dataset. Twofishes has been open source since we first wrote it, but recently we’re releasing prebuilt indexes, complete with autocomplete and partial worldwide city-level reverse geocoding functionality. Twofishes is used in Foursquare Explore on the web. We’re looking at using it with our mobile applications as well to provide the best experience to our users. We’re also proud to say that our friends at Twitter have found a use for it as well.


We’re eager to collaborate with others on continuing to source and create this data. If you know of open (redistributable, commercial-friendly) datasets that we’ve missed, please let us know. If you have large sources of labeled point data that you think could help create more accurate inferred polygons, we’re interested in that too. If you make use of the quattroshapes or twofishes project, we’d love to hear how you’re using it and how it’s working out for you.

David Blackman, Geo Lead at Foursquare

Load tests for the real world

The gold standard for systems performance measurement is a load test, which is a deterministic process of putting a demand on a system to establish its capacity. For example, you might load test a web search cluster by playing back actual logged user requests at a controlled rate. Load tests make great benchmarks for performance tuning exactly because they are deterministic and repeatable. Unfortunately, they just don’t work for some of us.

At Foursquare, we push new versions of our application code at master/HEAD to production at least daily. We are constantly adding features, tweaking how old features work, doing A/B tests on experimental features, and doing behind-the-scenes work like refactoring and optimization to boot. So any load test we might create would have to be constantly updated to keep up with new features and new code. This hypothetical situation is reminiscent of bad unittests that basically repeat the code being tested — duplicated effort for dubious gain.

To make things even worse, a lot of our features rely on a lot of data. For example, to surface insights after you check in to a location on Foursquare we have to consider all your previous check-ins, your friends’ check-ins, popular tips at the venue, nearby venues that are popular right now, etc. etc. Creating an environment in which we might run a meaningful load test would require us to duplicate a lot of data, maybe as much as the whole site. A lot of data means a lot of RAM to serve it from, and RAM is expensive.

So we usually choose not to attempt these “canned” load tests. In lieu of a classic load test, our go-to pre-launch performance test is what we call a “dark test.” A dark test involves generating extra work in the system in response to actual requests from users.

For example, in June 2012, we rolled out a major Foursquare redesign in which we switched the main view of the app from a simple list of recent friend check-ins to an activity stream which included other types of content like tips and likes. Behind the scenes, the activity stream implementation was much more complex than the old check-in list. This was in part because we wanted to support advanced behavior like collapsing (your friend just added 50 tips to her to-do list, we should collapse them all into a single stream item).


Before and after the redesign

Perhaps surprisingly, the biggest driver of additional complexity was the requirement for infinite scroll, which meant we needed to be ready to materialize any range of activity for all users. Since the intention was for the activity stream to be the main view a user sees upon opening the Foursquare app, we knew that the activity stream API endpoint would receive many, many requests as soon as users started to download and use the new version of the app. Above all, we did not want to make a big fuss about this great new feature and then give our users a bad experience by serving errors to them when they tried to use it. Dark testing was a key factor in making the launch a success.

The first version of the dark test was very simple: whenever a Foursquare client makes a request for the recent check-ins list, generate an activity stream response in parallel with the recent check-ins response, then throw the activity stream response away. We then hooked this up to a runtime control in our application which permitted it to be invoked on an arbitrary percentage of requests, so we were able to generate this work for one percent, five percent, 20 percent, etc. of all check-in list requests. By the time we were a few weeks out from redesign launch, we were running this test 24/7 for one-hundred percent of requests, which gave us pretty good confidence that we could launch this feature without overloading our systems.

Click here to read the full post.

– Cooper Bethea (@cooperb)

Native app integration like never before: The Foursquare for BlackBerry 10 SDK

Yesterday, BlackBerry announced their first BB10 devices. We here on the BlackBerry team at Foursquare are really excited about the launch and wanted to give our awesome third party developers something to help them get the most out of Foursquare and BlackBerry 10. With the help of the amazing Invocation Framework (learn more here) we have opened up a few parts of the Foursquare for BlackBerry 10 app to developers to enrich their native apps with Foursquare content easier than ever before. Check out the two examples below and then head over to GitHub to check out our sample app and get started.

Foursquare Single Sign On (SSO)
The first thing we’ve opened up, and a personal favorite, is the ability for your users to connect their Foursquare accounts with the click of a button. Instead of every app having to make their own WebView wrapper solution to the OAuth flow for obtaining an access token, we’ve built it right into the native Foursquare app for everyone to use. You can now let a user login to your app through Foursquare in 6 lines of code, and the user never has to leave the context of your app.

It is up to the user whether to approve or deny your app. We will send their action back to you, along with the access token if they decided to link their Foursquare account with your app.


Easy as that! So be sure to include a “connect with Foursquare” option in your app to reduce friction for new users signing up!

Foursquare Place Picker
More and more often the most engaging content that users can create comes with a location attached to it, whether that’s a picture posted on Instagram or a beer being checked into on Untappd. With the place picker api, you can build this rich content into your app built on the power of the over 50 million places in the Foursquare database. The best part about this is that just like the SSO api, you get the native Foursquare UI, network requests and GPS functionality built into your app without having to write any of it. Just use the invocation framework to launch it. If you know what your user is looking for already, you can pass in a query to prime the search with and if you have already authenticated a user, just pass in their token for personalize results!

Once a user selects a place, we’ll return back to you the JSON data for that place that you can process and then do whatever you need to do with it!


– Kyle, Foursquare BlackBerry Engineer