Exploring the Foursquare ‘Taste Map’

In order to deliver great personalized local recommendations, Foursquare needs to understand not only which places are the best, but also what makes places all over the world different from each other. Whether it’s a dive bar with great late night dancing or a molecular gastronomy restaurant with an amazing tasting menu, we want to categorize these places and understand the relationship that Foursquare users have with them. That’s why this summer we launched “Tastes,” short descriptive tags for venues to help users personalize their experience and find places that suit them. Tastes can be as simple as a favorite dish like “soup dumplings” or a vibe like “good for dates”.

To better understand what our taste data looks like, I created the “Foursquare Taste Map.” Here we see a visualization of the most popular three thousand English tastes. Each taste is connected with a line to others like it, and they are arranged so that similar tastes are closer together. For more technical folks, this is a spring embedding of the k-nearest neighbor graph of tastes using the cosine similarity metric (plotted in Gephi), where each taste is represented as a high-dimensional vector of venue affinities.

 

Taste Map (thumbnail)

Obviously it’s difficult to capture all of the relationships between these tastes on a single page, but you can still see amazing structure emerge like “wine island” on the far right, or various niches of Asian cuisine in the lower left hand corner, or a variety of different hubs emerge around common dishes like “seafood”, “chicken”, and “pizza.” We are so excited to have the opportunity to work with this unique data set to better understand all of the places in the world, and thought you’d enjoy this visualization.

Asian Cuisine Close-Up

The Foursquare Taste Map
(don’t forget to zoom in and scroll around, and add your favorite tastes to your Foursquare account)

Happy exploring!

Blake (@metablake)

Introducing Pants: a build system for large-scale codebases like Foursquare’s.

Foursquare and Swarm are written predominantly in Scala on the server side. But as we’ve grown, so have the size, complexity and diversity of our codebase:

  • We currently have around 700,000 lines of handwritten code in 6500 .scala files.
  • We generate about 1.9 million lines of Scala code from 1400 .thrift files using Spindle, our homegrown data model code generator.
  • We generate UI code from Closure Templates.
  • We compile CSS using Less.
  • We have a significant amount of Python code for running our deploy toolchain, data processing pipelines and other offline systems.
  • Like all large codebases, we also have little bits of other things here and there: some C, some Java, some Ruby, some Bash scripts, and so on.

Naturally there is a complex web of dependencies between different parts of the codebase. In fact our code dependency graph has about 2500 vertices and tens of thousands of edges.

We needed a build toolchain that would work well with this complexity, and the result is Pants, an open source build system we developed together with engineers from Twitter, Square and elsewhere.

Pants was designed to support large-scale, heterogeneous codebases. It uses fine-grained dependency management to build only what you need, keeping build times from getting unnecessarily long (a must when using Scala, with its slow compilation speeds). Pants also makes it straightforward to plug in custom build tools, such as code generators, and it supports every part of the build lifecycle: codegen, external dependency resolution, compilation, linting, test running, bundling deployable artifacts and more.

You can read more about Pants here, including the etymology of the name. If your codebase is growing beyond your toolchain’s ability to scale, you might want to give Pants a try. And of course we’re always looking for contributors to the project!

What today’s announcement means for developers

Update (August 2014): For the latest on our plans for the API, be sure to read our entire recent update for developers.

Today we announced some big news about the products we work on every day here at Foursquare. Read about it if you haven’t already—it’s some pretty exciting stuff. But what do these changes mean for the thousands of developers that rely on our public API?

If you are a developer, in the short run, this will have no effect on your apps. Your users will continue to be able to connect their Foursquare accounts to your app and the way to access the API remains unchanged. In the long run, look for a few coming changes but even more exciting features being added to the API as Swarm and Foursquare grow and evolve. As always, we plan on giving plenty of advanced notice before we make any major changes to the API.

While we have your attention though, take this opportunity to do some spring housecleaning for your Foursquare app! Make sure you follow @foursquareAPI and that the email address associated with your Foursquare app is up-to-date—we use both for making important announcements. And finally, are you ready for our upcoming versioning changes?

As always, you have these channels for questions and support: email, Twitter, StackOverflow, and (new!) AirPair.

Looking forward to Mongo 2.6: A deep dive into the new write commands

We’ve been longtime Mongo users at Foursquare and each new version has brought enhancements in performance, reliability, and features. Mongo 2.6 has a bunch of new features that we’re excited about, but I’m going to focus on just one which might be easy to gloss over if you’re looking at the release notes.

Write operations will be sent synchronously over the wire allowing for some simplifications to mongo internals that should yield performance benefits.

Background

Up until 2.6 all write operations were sent asynchronously over the wire. They were completely fire and forget. When a write command packet was sent to Mongo over a TCP connection, no bytes were returned by the server. In order for a client to get feedback on success, a “getLastError” command had to be sent separately on the same TCP connection with the developer’s durability preferences. The durability preference is referred to as the WriteConcern and might be something like “replicate to two servers” or “fsync to one server”. All the major mongo client drivers abstracted that async behavior so that it looked synchronous, but there were some real negative consequences.

Because the “getLastError” command could be sent at some future point after a write operation, clients had to have their TCP connections pinned for the duration of their sessions. That meant that there were constraints on connection pooling within the drivers and within the mongoS sharded router. Additionally, both the mongoS router and the mongoD servers had to keep track of the accounting of failed writes in order to match up future requests for feedback. Everything could be much simpler if write operations just blocked the client and returned their results.

With mongo 2.2 we hit some problems related to connection growth on our primary servers and mongodb engineers created a patch to allow for better connection reuse on the mongoS [https://jira.mongodb.org/browse/SERVER-9022]. That was a bit of a hack but it has worked very well for our use case.

New write commands

In 2.6, write operations are now sent using the existing command infrastructure that query, count, getlasterror, and all the operational commands utilize. Unlike the old fire and forget write operations, the new command based write operations send a reply for every request. Each type of write command also supports batching for more efficient network utilization and reliability. In the case of any failures within a batch, the reply will contain error messages for each of the operations that failed.

With 2.6, we hope to see further performance and scalability enhancements in our sharded clusters. The mongoS will do a much better job connection pooling because it will be able to reuse connections since they will no longer need to be pinned to each client connection. That should cause the connection counts to the mongoD servers to drop and allow us to add even more clients in the future without worrying about running into limits.

How it works

The drivers will continue to abstract things in a similar way, but for those interested on how the new write operation commands actually work, read further:

There are just a few different types of packets that can be sent over the wire. All queries and commands are sent using an OP_QUERY packet. OP_INSERT, OP_UPDATE, and OP_DELETE are the way to do write operations prior to 2.6 and the big change is that writes are now sent as commands using the OP_QUERY packet.

Commands and queries are both sent using the same packet over the wire and a simple convention is used to tell the server that the packet is a command: the “fullCollectionName” will be of the form “dbname.$cmd” and the “query” will be a BSON document with the command name and parameters.

For a new 2.6 update command, the document will look something like this:

{
 "update": "collectionName",
 "updates": [
   {
     "q": {...}, // the query
     "u": {...}, // the update
     "multi": false,
     "upsert": false
   }
 ],
 "ordered": true, // should errors stop the batch, true by default
 "writeConcern": {"w": 1}
}

Notice that the “updates” key is actually an array. That’s used to support batching of writes. A single update will simply be a batch of one. The “ordered” flag determines whether an error in one item in the batch will terminate further processing or if it will continue on despite the errors. If the flag is set to true, each operation will be executed sequentially, otherwise mongo can split the batch up and execute things out of order in the case where the operation affects multiple shards of a cluster.

As I mentioned above, the command is synchronous and the reply might look like this in the case of an upsert:

{
 "ok" : 1,
 "nModified" : 0,
 "n" : 1,
 "upserted" : [{
   "index" : 0, // matches the index of the update batch
   "_id" : ObjectId("532f9406fec9bb9b1bee9290")
 }]
}

If there was an error with one or more of the operations, the response might look like this in the case of an insert:

{
 "ok" : 1,
 "n" : 0,
 "writeErrors" : [{
   "index" : 0, // matches the index of the insert batch
   "code" : 11000,
   "errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.c.$_id_ dup key: { : ObjectId('532f941bfec9bb9b1bee9291') }"
 }]
}

If I had batched multiple inserts and the first failed but I had “ordered” set to false, the rest would be attempted. Otherwise the command would stop processing.

I hope this has been interesting for you. Keep in mind that the performance benefits that we’re excited about are completely theoretical until tested out in production. We aren’t using 2.6 in production yet, but plan on rolling it out slowly in the coming months. If you’re going to try 2.6 for yourself, you’ll need to upgrade your client driver to take advantage of the new command infrastructure.

Jon Hoffman
Foursquare Engineer

Good Tech Lead, Bad Tech Lead

A brief guide to tech leadership at Foursquare, inspired by Ben Horowitz’s Good Product Manager, Bad Product Manager.

Teamwork

Good tech leads act as a member of the team, and consider themselves successful when the team is successful. They take the unsexy grungy work and clear roadblocks so their team can operate at 100%. They work to broaden the technical capabilities of their team, making sure knowledge of critical systems is not concentrated in one or two minds.

Bad tech leads take the high-profile tasks for themselves and are motivated by being able to take credit for doing the work. They optimize locally, keeping team members working on projects that benefit the team at the expense of the engineering organization at large.

Technical vision

Good tech leads have an overall vision for the technical direction of the product and make sure the team understands it. They delegate feature areas to other team members and let them own their decisions. They recognize that their team members are smart, trust them, and rely on them to handle significant pieces of the project.

Bad tech leads resist explaining or clarifying the technical direction and dictate decisions instead. They keep critical institutional knowledge in their heads, failing to multiply their effectiveness by creating and disseminating helpful documentation.

Discussion and debate

Good tech leads listen and encourage debate. When the team is unable to resolve a debate, they describe a process or framework of thinking that would help them resolve it. They don’t enter discussions with foregone conclusions, and always allow themselves to be persuaded by great ideas.

Bad tech leads allow debates to go on for too long without resolution, hampering the productivity of the team. Others cut off debate prematurely, dismissing new discussions by saying the matter is “already settled.” Bad tech leads believe it is more important that they win the argument than that the team reaches the right decision.

Amazing comic courtesy @blackmad

Project management

Good tech leads are proactive. They make sure technical progress is on track. They work with team members to come up with estimates and to establish intermediate milestones. They anticipate areas of concern and make sure they are addressed before they become a problem. They identify technical roadblocks and help the team get around them. They identify areas of overlap where work can be shared, and conversely, find areas that are not getting enough attention and direct resources toward it.

Bad tech leads are reactive. They may delegate, but do not follow up to make sure progress is being made. They don’t set intermediate goals and hope that everything just comes together in the end. They wait until just before launch to do end-to-end tests of complex systems. They allow team members to waste time on interesting but unimportant work.

Pragmatism

Good tech leads are pragmatic and find a balance between doing it right and getting it done. They cut corners when it’s expedient but never out of laziness. They encourage their team to find temporary shortcuts or workarounds to problems that are blocking overall progress, and to build minimum viable infrastructure for launch. To good tech leads, details matter. Code quality, code reviews, and testing are just as important as shipping on time.

Bad tech leads take shortcuts that save time in the short term but cost more in the long term, and let technical debt pile up. They cannot distinguish between situations that call for expediency and those that call for perfection.

Communication

Good tech leads know that their role is much more than writing code, that effective communication is a vital part of their job, and that time spent making their team more efficient is time well spent. They acknowledge that some communication overhead is necessary when working on a team, and they sacrifice some personal productivity for overall team productivity.

Bad tech leads believe that they are most productive when they are writing code, and think communication is a distraction. They do not optimize for overall team productivity, but rather for what works best for themselves. They get frustrated when they have to take time to lead.

Relationship with Product

Good tech leads are in a conversation with product managers and designers about how the product should work. They are not afraid to push back on decisions they disagree with, but keep the product goals in mind and know when to accommodate them. They find creative workarounds to technical constraints by suggesting alternative product formulations that are less technically demanding, and help PMs and designers understand technical challenges so that they make informed trade-offs themselves.

Bad tech leads throw product decisions “over the wall” and do not take ownership of the product. They push back due to technical constraints but do not offer alternatives or explanations.

Resiliency

Good tech leads are resilient to changes to the product specification and react calmly to surprises. They anticipate where changes might take place and design their code to handle them.

Bad tech leads are upset when the specification changes, or prematurely generalize their design in areas where changes are unlikely to occur.

Personality

Good tech leads are easy-going but assertive. Bad tech leads are confrontational and aggressive. Good tech leads emerge naturally and earn respect through technical competence and experience. Bad tech leads think their title confers respect and authority. Good tech leads are always looking for ways to improve.

Bad tech leads get defensive when given feedback. Good tech leads are humble and boost the confidence of everyone else on the team. Bad tech leads are arrogant and take pleasure in making their teammates feel inferior.

-Jason Liszka (originally published on Medium)


Foursquare is hiring! http://foursquare.com/jobs

Mongo on Hadoop

At Foursquare, one of our most important pieces of data infrastructure is getting a copy of our production Mongo database into Hadoop. Today, we’re open-sourcing two parts of this job, a utility to dump Mongo to Hadoop, and code to read this data from MapReduce jobs.

Why Mongo on Hadoop?

Our Mongo database is the source of truth for everything at Foursquare. How many users do we have? How many people are checked into JFK right now? What’s the most-liked tip left by a friend at this bar? Ask Mongo. However, the database has to stay extremely responsive to reads and writes at all times; no one wants their check ins to show up ten minutes after they happen! This means that more expensive queries (such as looking at every user joined with every venue they’ve visited) cannot happen on Mongo. So how do we answer these questions? We dump the data to Hadoop!

What is Hadoop?

Hadoop is an open-source computation framework for massive sets of data. It has two main components: HDFS (Hadoop Distributed File System) for storing data, and MapReduce for running computations on that data. Foursquare’s current Hadoop cluster is about 100 servers in our datacenter, with 2.5 Petabytes of raw storage. Hadoop powers a lot of what we do here, from business and product decisions, to some of our coolest features (like the real-time notifications from the all-new Foursquare).

Having our Mongo data in Hadoop allows us to have the exact same view of the world that the app has. It lets us ask massive questions without any impact on the production database.

Getting The Data There

It starts with the Mongo replicas, which are also running in our datacenter. These nodes are all running independent LVM stacks beneath each mongod process (one LVM stack per physical SSD volume). Every six hours a process running on each node issues an LVM snapshot command for the disk on which a given mongo database runs. A central coordinator ensures that this happens as close as possible to simultaneously across all clusters. This creates a “sane” snapshot of the cluster. If shards were snapshotted at very different times, there could be records with foreign keys that don’t exist. These snapshots are archived and compressed locally, then uploaded to a specific directory named according to the snapshot group being taken in HDFS.

A separate process is continuously monitoring this directory, waiting for a complete cluster to be available (i.e., every shard from the cluster exists). Once that happens, the files are downloaded, decompressed, and extracted in parallel across several servers. When an individual file finishes downloading, we launch the mongodump utility to write the data back to HDFS. This data is in a format that’s easier to consume in MapReduce jobs. For example, all our checkin data up to and including 2014-01-01 is in a single directory, stored in Mongo’s BSON format: /datasets/mongo/bson/ident=checkins/version=2014-01-01/col=checkins/

Reading the Data

Having our entire Mongo database available in Hadoop is nice, but it’s not very useful unless it can be read by MapReduce jobs. Every Mongo collection that gets read in Hadoop has an associated Thrift definition, and a custom input format turns the BSON into the Thrift object. Since we use Scala, the Thrift objects are generated Scala code, using Spindle.

When an entire cluster is finished for a particular day, we put a marker file to indicate this to Hadoop jobs. Another process updates our Hive tables to point to the new data, entirely transparent to people writing Hive queries.

– Joe Ennever (@TDJoe), Data Infrastructure Engineer

The Mathematics of Gamification

At Foursquare, we maintain a database of 60 million venues. And like the world it represents, our database is ever-changing, with users from all over the world submitting updates on everything from the hours of a restaurant to the address of a new barbershop. To maintain the accuracy of our venue database, these changes are voted upon by our loyal Superusers (SUs) who vigilantly maintain a watchful eye over our data for their city or neighborhood.

Like many existing crowd-sourced datasets (Quora, Stack Overflow, Amazon Reviews), we assign users points or votes based on their tenure, reputation, and the actions they take. Superusers like points and gamification. It rewards diligent, hard-working SUs (which are the majority) and punishes the few malicious “bad players.” But data scientists like probabilities and guarantees. We’re interested in making statements like, “we are 99% confident that each entry is correct.” How do we allocate points to users in a way that rewards them for behavior but allows us to make guarantees about the accuracy of our database?

At Foursquare, we have a simple, first-principles based method of resolving proposed venue attribute updates. We can gauge each Superuser’s voting accuracy based on their performance on honeypots (proposed updates with known answers which are deliberately inserted into the updates queue). Measuring performance and using these probabilities correctly is the key to how we assign points to a Superuser’s vote.

The Math

Let’s make this more concrete with some math. Let $H_0$ denote the true state of the world, either $1$ or $-1$, which we can interpret as a proposed update being true or false, respectively. We do not observe this but we know $H_0 = 1$ with a-priori probability $p_0$. User 1 votes $H_1$ (again, either $1$ or $-1$, representing “yay” or “nay”) with independent probability $p_1$ of agreeing with the truth $H_0$ and $(1-p_1)$ of disagreeing. Bayes’ Rule then gives us
\begin{align*}
\P(H_0 = 1 | H_1 = 1) & = \frac{\P(H_1 = 1 | H_0 = 1) \P(H_0 = 1)}{\P(H_1 = 1)} \\
& = \frac{p_0 p_1}{p_0 p_1 + (1-p_0)(1-p_1)} \\
& = \frac{\ell_0 \ell_1}{\ell_0 \ell_1 + 1}
\end{align*}where we have written the solution in terms of the likelihood ratio $\ell_k = \ell(p_k)$ given by
\[ \ell(p) = \frac{p}{1-p} \qquad \ell^{-1}(\cdot) = \frac{\cdot}{1 + \cdot}\,. \]Then we have that
\[ \P(H_0 = 1 | H_1 = 1) = \ell^{-1}(\ell(p_0) \ell(p_1))\,. \]In fact, it is easy to see that in the general case,
\[ \P(H_0 = 1 | H_1 = h_1) = \ell^{-1}(\ell(p_0) \ell(p_1)^{h_1})\,. \]Multiplication is hard so we will define the logit or log-likelihood function
\[ \logit: (0,1) \to (-\infty, \infty) \]given by
\[ \logit(p) = \log(\ell(p)) = \log\prn{\frac{p}{1-p}} \qquad \logit^{-1}(\cdot) = \frac{e^{\cdot}}{1 + e^{\cdot}}\,. \]Then we have
\[ \P(H_0 = 1| H_1 = h_1) = \logit^{-1}(\logit(p_0) + h_1 \logit(p_1))\,. \]

Continuing, assume that after user 1 casts their vote, user 2 votes $H_2$ with an independent probability $p_2$ of being correct (i.e. agreeing with $H_0$). We can think of the posterior probability $\P(H_0 = 1| H_1 = h_1)$ as our new prior and inductively repeat the above Bayesian analysis to obtain
\begin{align*}
\P(H_0 = 1 | H_1 = h_1, H_2 = h_2) & = \logit^{-1}\prn{\logit\prn{\P(H_0 = 1 | H_1 = h_1)} + h_2 \logit(p_2)} \\
& = \logit^{-1}(\logit(p_0) + h_1 \logit(p_1) + h_2 \logit(p_2))\,.
\end{align*}In fact, if we have $n$ votes $H_1, \ldots, H_n$, then we have
\begin{align} & \P(H_0 = 1| H_1 = h_1, \ldots, H_n = h_n) \nonumber\\
& \qquad\qquad = \logit^{-1}\prn{ \logit(p_0) + \sum_{k=1}^n h_k \logit(p_k) } \,. \label{eq:main}
\end{align}

The Solution

The above equation suggests that we should assign $s_k$ points or votes to user $k$ based on \begin{equation} s_k = \logit(p_k)\,. \label{eq:points} \end{equation} We can add up all the “yay” votes and subtract all the “nay” votes to obtain a score for the update. This score can easily be interpreted as a probability that the update is correct. We can set a certainty threshold $p$ (e.g. $p = 99\%$) as a threshold for a desired accuracy of this edit. Then, we accept a proposed edit as soon as \begin{equation} \logit(p_0) + \sum_{k=1}^n h_k \logit(p_k) \ge \logit(p) \label{eq:upper} \end{equation} and reject it as soon as \begin{equation} \logit(p_0) + \sum_{k=1}^n h_k \logit(p_k) \le – \logit(p)\,. \label{eq:lower} \end{equation}

In other words, if we take $t = \logit(p)$ to the the points threshold and $s_0 = \logit(p_0)$ to be the points allocated to a new proposed edit, then \eqref{eq:upper} and \eqref{eq:lower} become
\[ s_0 + \sum_{k=1}^n h_k s_k \ge t \]and
\[ s_0 + \sum_{k=1}^n h_k s_k \le - t\,, \]which are exactly the equations for voting you would expect. But now, they’re derived from math!

The Benefits

  • Efficient, data-driven guarantees about database accuracy. By choosing the points based on a user’s accuracy, we can intelligently accrue certainty about a proposed update and stop the voting process as soon as the math guarantees the required certainty.
  • Still using points, just smart about calculating them. By relating a user’s accuracy and the certainty threshold needed to accept a proposed update to an additive point system \eqref{eq:points}, we can still give a user the points that they like. This also makes it easy to take a system of ad-hoc points and convert it over to a smarter system based on empirical evidence.
  • Scalable and easily extensible. The parameters are automatically trained and can adapt to changes in the behavior of the userbase. No more long meetings debating how many points to grant to a narrow use case.
    So far, we’ve taken a very user-centric view of $p_k$ (this is the accuracy of user $k$). But we can go well beyond that. For example, $p_k$ could be “the accuracy of user $k$’s vote given that they have been to the venue three times before and work nearby.” These clauses can be arbitrarily complicated and estimated from a (logistic) regression of the honeypot performance. The point is that these changes will be based on data and not subjective judgments of how many “points” a user or situation should get.

Some practical considerations:

  • In practice, we might want a different threshold for accepting \eqref{eq:upper} versus rejecting \eqref{eq:lower} a proposed edit.
  • For notational simplicity, we have assumed that a false positives and false negatives in user $k$’s voting accuracy have the same probability $p_k$. In general, this is not the case. We leave it to the reader to figure the math of the general case.
  • Users like integer points. We have to round $s_k$ to the nearest integer. Because we can multiply linear equations like \eqref{eq:upper} and \eqref{eq:lower} by a positive constant, we can set $s_k = [\alpha \cdot \logit(p_k)]$ where $[\cdot]$ is the rounding function and $\alpha$ is a large positive constant. A large $\alpha$ will prevent the loss of fidelity.
  • We’ve explained how to obtain $p_1, p_2, \ldots$ from honeypots but how do we obtain $p_0$, the accuracy of newly proposed updates. One way is to use the above to bootstrap those accuracies from voting: we can use this voting technique to infer the accuracy of proposals by looking at what fraction of proposed updates are accepted!
  • Bayesian Smoothing. We assume a relatively low-accuracy prior for the accuracy of individuals. This is a pessimistic assumption that keeps new, untested users from having too much influence. It also rewards users for lending their judgment and casting votes as long as those are more accurate than our pessimistic prior. Of course, we also increase the likelihood of showing new Super Users honeypots to give them a chance to prove themselves.

–Michael Li
Data Scientist
@tianhuil

Foursquare’s new notifications and the future of contextual mobile experiences

For the last year I’ve been obsessed with a new breed of mobile applications that are aware of a user’s context: who they are, where they are in the world, and what is going on around them.  Apps like Dark SkyGoogle Now, and Square Wallet are starting to enable amazing new real-world experiences that make users feel like they have superpowers by connecting them seamlessly to information.

Last month, we launched the new Foursquare notifications which automatically lets people know about the best dishes on the menu when they walk into a restaurant, or the top spots not to miss when they land in a new city.  In this talk at Data Driven NYC, I explain how we built this exciting new product from the data exhaust of millions of mobile devices, and how it sets the groundwork for an exciting new world of highly-targeted contextual experiences.

Data Driven NYC 20 // Blake Shaw of Foursquare from Matt Turck on Vimeo.

- Blake (@metablake)

A chat about data science and our fun visualizations

A little while back, I gave a talk on a Big Data Panel at the Stanford Graduate School of Business’s China 2.0 conference.  We had a great discussion about the uses of data science and the fun visualizations we do with our data at Foursquare. Check it out: 

–Michael
@tianhuil

How we built our Model Training Engine

At Foursquare, we have large-scale machine-learning problems. From choosing which venue a user is trying to check in at based on a noisy GPS signal, to serving personalized recommendations, discounts, and promoted updates to users based on where they or their friends have been, almost every aspect of the app uses machine-learning in some way.  All of these queries happen at a massive scale: we average one million Explore queries and six million check-ins every day. Not only do we have to process each request faster than the blink of an eye, but these millions of user interactions are giving us millions of data points to feed back into our models to make them better. We’ve been building out a Model Training Engine (MTE) to automate our (machine) learning from user data.  Here’s an overview to whet your appetite.

Fitting the model to the data rather than the data to the model.

Many models are built using linear regressions or similar approaches. While these models can help us quickly understand data (and we certainly make use of them), they make convenient but unrealistic assumptions and are limited in the kinds of relationships they can express. The MTE uses techniques liked Boosted Decision Trees or Random Forests (we have both a scikit-learn and an in-house MapReduce based implementation) to learn much more detailed and nuanced models that fit the data better.

Keeping models fresh and relevant.

With 6 million new check-ins a day, models quickly get stale. The MTE automatically retrains models daily based on the latest signals and the latest data. New signals and changes in old signals are immediately incorporated into new models and we monitor and deploy newer models when they outperform older ones.

Model training that scales with data and the organization.

With a large-scale, very interconnected system, changes made by other engineers on a seemingly unrelated app feature could throw off a very carefully calibrated model. How do we make model building scale across billions of check-ins and an entire organization without engineers stepping on each other’s toes?

To make models scalable across our data, we’ve rolled our own online learning algorithms and use clever downsampling techniques when we cannot load the entire dataset into memory. We use techniques like bagging and cross-validation to optimally understand how to combine different signals into a single prediction in a way that maximizes the contribution from each signal without picking up on spurious correlations (aka overfitting). This means that no one can throw off the model by adding or tweaking a signal. For example, If an engineer accidentally adds random noise (e.g. dice rolls) as a signal, the MTE would quickly detect that signal was not predictive and ignore it. This allows us to be open to new ideas and signals from pretty much anyone at the company, not just data scientists.

What’s more, the MTE can adapt to frequent UX and other product changes, all without human intervention. For example, if our mobile team changes the UI to make friends’ prior visits more prominent, the MTE will automatically detect that users are weighing social signals more heavily and adjust our models accordingly. And our automated Model Training Engine means that engineers can concentrate on building signals and let the model training select their best ones.

All of these quality improvements are translating into a better and smarter user experience. More details (with code) and quality improvements to come!

–Michael Li, Data Scientist
@tianhuil