A chat about data science and our fun visualizations

A little while back, I gave a talk on a Big Data Panel at the Stanford Graduate School of Business’s China 2.0 conference.  We had a great discussion about the uses of data science and the fun visualizations we do with our data at Foursquare. Check it out: 


How we built our Model Training Engine

At Foursquare, we have large-scale machine-learning problems. From choosing which venue a user is trying to check in at based on a noisy GPS signal, to serving personalized recommendations, discounts, and promoted updates to users based on where they or their friends have been, almost every aspect of the app uses machine-learning in some way.  All of these queries happen at a massive scale: we average one million Explore queries and six million check-ins every day. Not only do we have to process each request faster than the blink of an eye, but these millions of user interactions are giving us millions of data points to feed back into our models to make them better. We’ve been building out a Model Training Engine (MTE) to automate our (machine) learning from user data.  Here’s an overview to whet your appetite.

Fitting the model to the data rather than the data to the model.

Many models are built using linear regressions or similar approaches. While these models can help us quickly understand data (and we certainly make use of them), they make convenient but unrealistic assumptions and are limited in the kinds of relationships they can express. The MTE uses techniques liked Boosted Decision Trees or Random Forests (we have both a scikit-learn and an in-house MapReduce based implementation) to learn much more detailed and nuanced models that fit the data better.

Keeping models fresh and relevant.

With 6 million new check-ins a day, models quickly get stale. The MTE automatically retrains models daily based on the latest signals and the latest data. New signals and changes in old signals are immediately incorporated into new models and we monitor and deploy newer models when they outperform older ones.

Model training that scales with data and the organization.

With a large-scale, very interconnected system, changes made by other engineers on a seemingly unrelated app feature could throw off a very carefully calibrated model. How do we make model building scale across billions of check-ins and an entire organization without engineers stepping on each other’s toes?

To make models scalable across our data, we’ve rolled our own online learning algorithms and use clever downsampling techniques when we cannot load the entire dataset into memory. We use techniques like bagging and cross-validation to optimally understand how to combine different signals into a single prediction in a way that maximizes the contribution from each signal without picking up on spurious correlations (aka overfitting). This means that no one can throw off the model by adding or tweaking a signal. For example, If an engineer accidentally adds random noise (e.g. dice rolls) as a signal, the MTE would quickly detect that signal was not predictive and ignore it. This allows us to be open to new ideas and signals from pretty much anyone at the company, not just data scientists.

What’s more, the MTE can adapt to frequent UX and other product changes, all without human intervention. For example, if our mobile team changes the UI to make friends’ prior visits more prominent, the MTE will automatically detect that users are weighing social signals more heavily and adjust our models accordingly. And our automated Model Training Engine means that engineers can concentrate on building signals and let the model training select their best ones.

All of these quality improvements are translating into a better and smarter user experience. More details (with code) and quality improvements to come!

–Michael Li, Data Scientist

Foursquare Native Auth on iOS and Android: Developers, connect your users more quickly than ever

A few weeks ago we were excited to announce one of our most-wished-for features from our developer community, native authentication for iOS, and today we’re happy to announce we’ve also shipped support for native auth on Android in our latest release of Foursquare on Google Play! In a nutshell, this means that your users can connect their Foursquare accounts to your app without wrangling with messy WebViews and log-ins. Native authentication simply pops your users into the Foursquare app on their phone and lets them use their existing credentials there.

And even though this has only been out for a few short weeks, we love what our developers have been doing with it so far. If you want to see what native auth looks and feels like in the wild, install the latest version of quick check-in app Checkie: after using Foursquare to find a place for you and your friends to go, Checkie lets you check in with incredible speed.

Since Checkie uses our checkins/add endpoint, users need a way to log in. Below is what the app used to look like upon opening. Users are taken directly to a WebView where the user had to type in—and more importantly, remember, without the aid of Facebook Connect—their Foursquare credentials before continuing to use Checkie.

For this old flow to succeed, at least four taps are necessary, along with who knows how many keystrokes. Below is how the new Checkie flow works after integrating native auth: there’s a more informational screen when the app opens, and only two taps are necessary to begin actually using Checkie: “Sign in,” which bumps users to the Foursquare app where they can hit “Allow.”

How You Can Use Native Auth Today

You too can get started using this flow right away. We have libraries and sample code for iOS and Android available on GitHub that you can dive straight into. The details vary depending on OS, but the overall conceptual process is similar for both and outlined below—it should be familiar for those who have worked with 3-legged OAuth before.

  1. Update your app’s settings. You need to modify your app’s redirect URIs (iOS) or add a key hash (Android).

  2. Include our new libraries in your project. OS-specific instructions are found on their GitHub pages.

  3. Unless you want to use it as a backup mechanism, get rid of that (UI)WebView! Chances are, if you expect your users to have Foursquare accounts, they’ll have the app on their phones.

  4. Call our new native authorize methods. On iOS, it’s authorizeUserUsingClientId; on Android, it’s FoursquareOAuth.getConnectIntent then startActivityForResult with the returned intent. These methods bounce your users to the Foursquare app’s authorize screen or return appropriate fallback responses allowing them to download the app.

  5. If you user authorizes your app, your user will land back in your app. Follow OS-specific instructions to obtain an access code. This should involve calling either accessCodeForFSOAuthURL (iOS) or FoursquareOAuth.getAuthCodeFromResult (Android).

  6. Trade this access code for an access token. The access token (not access code) is what is eventually used to make calls on behalf of a particular user. There are two ways to do this:

    1. (Preferred) Pass the access token to your server, and then make a server-side call to https://foursquare.com/oauth2/access_token—see step 3 under our code flow docs for details on the exact parameters needed. The response from Foursquare will be an access token, which can be saved and should be used to make auth’d requests. This method is preferable because it avoids including your client secret into your app. For more details, see our page on connecting.

    2. Call our new native methods to get an access token. On iOS it’s requestAccessTokenForCode. On Android it’s FSOauth.getTokenExchangeIntent followed by startActivityForResult (make sure you also make requisite changes to AndroidManifest.xml)

If you have any comments or questions about this new native auth flow—or anything API-related in general!—please reach out to api@foursquare.com.

David Hu, Developer Advocate

Machine learning at Foursquare

In March, I spoke at Queens Open Tech about machine learning at Foursquare. The talk gives a nice overview of the kinds of insights we have about human behavior from check-in data and our machine-learning setup. Learn how we used smarter algorithms to get 20,000 people to try a new place every week.

- Michael Li, Data Scientist at Foursquare

Quattroshapes: A Global Polygon Gazetteer from Foursquare

Foursquare geographic infrastructure relies on numerous pieces of open geo software: PostGIS, GDAL, Shapely, Fiona, QGIS, S2, and JTS as well as open geographic data: OSM, geonames.org, US Census’ TIGER, Canada’s geogratis, Mexico’s INEGI and EuroGeoGraphics to name a few. We’ve been inspired by existing efforts around geographic data including the alphashapes and betashapes projects. We are eager and excited to contribute back to the open geo ecosystem with a few projects that I demoed recently at foss4g-na and State of the Map US.



Geographic polygon / boundary data is important to us as a way to aggregate venues around places like cities and neighborhood. Finding a good source of city data around the world has proved difficult. For that reason, we’ve been curating a set of worldwide polygon data that we’re calling Quattroshapes. Quattroshapes debuted at Nathaniel Vaughn Kelso’s talk at State of the Map US this past weekend. The project combines normalizing open government data with synthesizing new polygons out of flickr photos and Foursquare checkin data in places where open government data is unavailable. It’s called quattroshapes because it’s the fourth iteration (that we know of) of the work flickr did on alphashapes and SimpleGeo on betashapes also, it’s based on a quadtree.

We use this polygon data in twofishes, our coarse, splitting, forward and reverse geocoder based on the geonames.org dataset. Twofishes has been open source since we first wrote it, but recently we’re releasing prebuilt indexes, complete with autocomplete and partial worldwide city-level reverse geocoding functionality. Twofishes is used in Foursquare Explore on the web. We’re looking at using it with our mobile applications as well to provide the best experience to our users. We’re also proud to say that our friends at Twitter have found a use for it as well.


We’re eager to collaborate with others on continuing to source and create this data. If you know of open (redistributable, commercial-friendly) datasets that we’ve missed, please let us know. If you have large sources of labeled point data that you think could help create more accurate inferred polygons, we’re interested in that too. If you make use of the quattroshapes or twofishes project, we’d love to hear how you’re using it and how it’s working out for you.

David Blackman, Geo Lead at Foursquare

Load tests for the real world

The gold standard for systems performance measurement is a load test, which is a deterministic process of putting a demand on a system to establish its capacity. For example, you might load test a web search cluster by playing back actual logged user requests at a controlled rate. Load tests make great benchmarks for performance tuning exactly because they are deterministic and repeatable. Unfortunately, they just don’t work for some of us.

At Foursquare, we push new versions of our application code at master/HEAD to production at least daily. We are constantly adding features, tweaking how old features work, doing A/B tests on experimental features, and doing behind-the-scenes work like refactoring and optimization to boot. So any load test we might create would have to be constantly updated to keep up with new features and new code. This hypothetical situation is reminiscent of bad unittests that basically repeat the code being tested — duplicated effort for dubious gain.

To make things even worse, a lot of our features rely on a lot of data. For example, to surface insights after you check in to a location on Foursquare we have to consider all your previous check-ins, your friends’ check-ins, popular tips at the venue, nearby venues that are popular right now, etc. etc. Creating an environment in which we might run a meaningful load test would require us to duplicate a lot of data, maybe as much as the whole site. A lot of data means a lot of RAM to serve it from, and RAM is expensive.

So we usually choose not to attempt these “canned” load tests. In lieu of a classic load test, our go-to pre-launch performance test is what we call a “dark test.” A dark test involves generating extra work in the system in response to actual requests from users.

For example, in June 2012, we rolled out a major Foursquare redesign in which we switched the main view of the app from a simple list of recent friend check-ins to an activity stream which included other types of content like tips and likes. Behind the scenes, the activity stream implementation was much more complex than the old check-in list. This was in part because we wanted to support advanced behavior like collapsing (your friend just added 50 tips to her to-do list, we should collapse them all into a single stream item).


Before and after the redesign

Perhaps surprisingly, the biggest driver of additional complexity was the requirement for infinite scroll, which meant we needed to be ready to materialize any range of activity for all users. Since the intention was for the activity stream to be the main view a user sees upon opening the Foursquare app, we knew that the activity stream API endpoint would receive many, many requests as soon as users started to download and use the new version of the app. Above all, we did not want to make a big fuss about this great new feature and then give our users a bad experience by serving errors to them when they tried to use it. Dark testing was a key factor in making the launch a success.

The first version of the dark test was very simple: whenever a Foursquare client makes a request for the recent check-ins list, generate an activity stream response in parallel with the recent check-ins response, then throw the activity stream response away. We then hooked this up to a runtime control in our application which permitted it to be invoked on an arbitrary percentage of requests, so we were able to generate this work for one percent, five percent, 20 percent, etc. of all check-in list requests. By the time we were a few weeks out from redesign launch, we were running this test 24/7 for one-hundred percent of requests, which gave us pretty good confidence that we could launch this feature without overloading our systems.

Click here to read the full post.

– Cooper Bethea (@cooperb)

Native app integration like never before: The Foursquare for BlackBerry 10 SDK

Yesterday, BlackBerry announced their first BB10 devices. We here on the BlackBerry team at Foursquare are really excited about the launch and wanted to give our awesome third party developers something to help them get the most out of Foursquare and BlackBerry 10. With the help of the amazing Invocation Framework (learn more here) we have opened up a few parts of the Foursquare for BlackBerry 10 app to developers to enrich their native apps with Foursquare content easier than ever before. Check out the two examples below and then head over to GitHub to check out our sample app and get started.

Foursquare Single Sign On (SSO)
The first thing we’ve opened up, and a personal favorite, is the ability for your users to connect their Foursquare accounts with the click of a button. Instead of every app having to make their own WebView wrapper solution to the OAuth flow for obtaining an access token, we’ve built it right into the native Foursquare app for everyone to use. You can now let a user login to your app through Foursquare in 6 lines of code, and the user never has to leave the context of your app.

It is up to the user whether to approve or deny your app. We will send their action back to you, along with the access token if they decided to link their Foursquare account with your app.


Easy as that! So be sure to include a “connect with Foursquare” option in your app to reduce friction for new users signing up!

Foursquare Place Picker
More and more often the most engaging content that users can create comes with a location attached to it, whether that’s a picture posted on Instagram or a beer being checked into on Untappd. With the place picker api, you can build this rich content into your app built on the power of the over 50 million places in the Foursquare database. The best part about this is that just like the SSO api, you get the native Foursquare UI, network requests and GPS functionality built into your app without having to write any of it. Just use the invocation framework to launch it. If you know what your user is looking for already, you can pass in a query to prime the search with and if you have already authenticated a user, just pass in their token for personalize results!

Once a user selects a place, we’ll return back to you the JSON data for that place that you can process and then do whatever you need to do with it!


– Kyle, Foursquare BlackBerry Engineer

The Opportunity Gap: learn about the public schools in your neighborhood with @ProPublica and Foursquare

As our recent hackathon showed, there are tons of ways that developers can use Foursquare to power amazing apps – from ones that shame you into going to the gym, to others that alert you to restaurants with health code violations.

Today, ProPublica, an award-winning investigative journalism site, is relaunching its Opportunity Gap news application, which helps people find and compare statistics about public schools across the nation. With their new Foursquare integration, you can connect your Foursquare account to instantly see statistics for schools you’ve checked in to before. And when you’re out, you can instantly get stats about a school on your phone whenever you check in to one. It’s a great example of how news organizations can use Foursquare to reach their readers with relevant information when they’re out in the real world.


Learn more about The Opportunity Gap and connect your Foursquare account here.

A Foursquare Hackathon for the new year – sign up and build the next amazing hack!

And take two! Although Sandy foiled our hackathon plans in November, we’re back and ready to hack it up with the best and brightest. On January 5, we’re inviting developers and designers in NYC, SF, and everywhere around the world to sign up and build some more amazing hacks using the Foursquare API. We’ll have prizes, swag, and (naturally) global glory for the best Foursquare hacks, no matter where in the world they originate.

Head over to our Meetup page and keep an eye on hackathon.foursquare.com for more details as the event draws nearer. If you’re in SF or NYC, sign up to work from our HQ, otherwise, you can use Meetup to connect with hackers in your area.

(Fun fact: back in 2010, two designer/developer friends got together and entered the first Foursquare hackathon. They built a snazzy little hack that took people’s Foursquare check-in histories and resurfaced them in the form of a daily email that showcased people’s check-ins from exactly one year ago. They dubbed their hack, “4SquareAnd7YearsAgo,” and today you might know it as Timehop.)

Now, go sign up for the Foursquare Hackathon 2013!

MongoDB at Foursquare: From the the cloud to bare metal

We’ve been running Mongo as our main storage engine for almost 3 years. For most of that time the Mongo servers and all the rest of our infrastructure were hosted on Amazon’s EC2. We recently migrated the Mongo servers onto our own hardware hosted in a datacenter and now have a hybrid environment with everything else still on EC2. I’ll be talking about why and how we did this at the upcoming MongoSV conference on Tuesday, December 4th, 2012.

The name of the talk is “MongoDB at Foursquare: From the the cloud to bare metal.” Come check it out!

For more information about the event, check out the agenda or 10gen’s blog post Get Ready for MongoSV. Register now with the discount code “foursquare20” and receive 20% off.

– Jon Hoffman (@hoffrocket)