Good Tech Lead, Bad Tech Lead

A brief guide to tech leadership at Foursquare, inspired by Ben Horowitz’s Good Product Manager, Bad Product Manager.


Good tech leads act as a member of the team, and consider themselves successful when the team is successful. They take the unsexy grungy work and clear roadblocks so their team can operate at 100%. They work to broaden the technical capabilities of their team, making sure knowledge of critical systems is not concentrated in one or two minds.

Bad tech leads take the high-profile tasks for themselves and are motivated by being able to take credit for doing the work. They optimize locally, keeping team members working on projects that benefit the team at the expense of the engineering organization at large.

Technical vision

Good tech leads have an overall vision for the technical direction of the product and make sure the team understands it. They delegate feature areas to other team members and let them own their decisions. They recognize that their team members are smart, trust them, and rely on them to handle significant pieces of the project.

Bad tech leads resist explaining or clarifying the technical direction and dictate decisions instead. They keep critical institutional knowledge in their heads, failing to multiply their effectiveness by creating and disseminating helpful documentation.

Discussion and debate

Good tech leads listen and encourage debate. When the team is unable to resolve a debate, they describe a process or framework of thinking that would help them resolve it. They don’t enter discussions with foregone conclusions, and always allow themselves to be persuaded by great ideas.

Bad tech leads allow debates to go on for too long without resolution, hampering the productivity of the team. Others cut off debate prematurely, dismissing new discussions by saying the matter is “already settled.” Bad tech leads believe it is more important that they win the argument than that the team reaches the right decision.

Amazing comic courtesy @blackmad

Project management

Good tech leads are proactive. They make sure technical progress is on track. They work with team members to come up with estimates and to establish intermediate milestones. They anticipate areas of concern and make sure they are addressed before they become a problem. They identify technical roadblocks and help the team get around them. They identify areas of overlap where work can be shared, and conversely, find areas that are not getting enough attention and direct resources toward it.

Bad tech leads are reactive. They may delegate, but do not follow up to make sure progress is being made. They don’t set intermediate goals and hope that everything just comes together in the end. They wait until just before launch to do end-to-end tests of complex systems. They allow team members to waste time on interesting but unimportant work.


Good tech leads are pragmatic and find a balance between doing it right and getting it done. They cut corners when it’s expedient but never out of laziness. They encourage their team to find temporary shortcuts or workarounds to problems that are blocking overall progress, and to build minimum viable infrastructure for launch. To good tech leads, details matter. Code quality, code reviews, and testing are just as important as shipping on time.

Bad tech leads take shortcuts that save time in the short term but cost more in the long term, and let technical debt pile up. They cannot distinguish between situations that call for expediency and those that call for perfection.


Good tech leads know that their role is much more than writing code, that effective communication is a vital part of their job, and that time spent making their team more efficient is time well spent. They acknowledge that some communication overhead is necessary when working on a team, and they sacrifice some personal productivity for overall team productivity.

Bad tech leads believe that they are most productive when they are writing code, and think communication is a distraction. They do not optimize for overall team productivity, but rather for what works best for themselves. They get frustrated when they have to take time to lead.

Relationship with Product

Good tech leads are in a conversation with product managers and designers about how the product should work. They are not afraid to push back on decisions they disagree with, but keep the product goals in mind and know when to accommodate them. They find creative workarounds to technical constraints by suggesting alternative product formulations that are less technically demanding, and help PMs and designers understand technical challenges so that they make informed trade-offs themselves.

Bad tech leads throw product decisions “over the wall” and do not take ownership of the product. They push back due to technical constraints but do not offer alternatives or explanations.


Good tech leads are resilient to changes to the product specification and react calmly to surprises. They anticipate where changes might take place and design their code to handle them.

Bad tech leads are upset when the specification changes, or prematurely generalize their design in areas where changes are unlikely to occur.


Good tech leads are easy-going but assertive. Bad tech leads are confrontational and aggressive. Good tech leads emerge naturally and earn respect through technical competence and experience. Bad tech leads think their title confers respect and authority. Good tech leads are always looking for ways to improve.

Bad tech leads get defensive when given feedback. Good tech leads are humble and boost the confidence of everyone else on the team. Bad tech leads are arrogant and take pleasure in making their teammates feel inferior.

Jason Liszka (originally published on Medium)

Foursquare is hiring!

Mongo on Hadoop

At Foursquare, one of our most important pieces of data infrastructure is getting a copy of our production Mongo database into Hadoop. Today, we’re open-sourcing two parts of this job, a utility to dump Mongo to Hadoop, and code to read this data from MapReduce jobs.

Why Mongo on Hadoop?

Our Mongo database is the source of truth for everything at Foursquare. How many users do we have? How many people are checked into JFK right now? What’s the most-liked tip left by a friend at this bar? Ask Mongo. However, the database has to stay extremely responsive to reads and writes at all times; no one wants their check ins to show up ten minutes after they happen! This means that more expensive queries (such as looking at every user joined with every venue they’ve visited) cannot happen on Mongo. So how do we answer these questions? We dump the data to Hadoop!

What is Hadoop?

Hadoop is an open-source computation framework for massive sets of data. It has two main components: HDFS (Hadoop Distributed File System) for storing data, and MapReduce for running computations on that data. Foursquare’s current Hadoop cluster is about 100 servers in our datacenter, with 2.5 Petabytes of raw storage. Hadoop powers a lot of what we do here, from business and product decisions, to some of our coolest features (like the real-time notifications from the all-new Foursquare).

Having our Mongo data in Hadoop allows us to have the exact same view of the world that the app has. It lets us ask massive questions without any impact on the production database.

Getting The Data There

It starts with the Mongo replicas, which are also running in our datacenter. These nodes are all running independent LVM stacks beneath each mongod process (one LVM stack per physical SSD volume). Every six hours a process running on each node issues an LVM snapshot command for the disk on which a given mongo database runs. A central coordinator ensures that this happens as close as possible to simultaneously across all clusters. This creates a “sane” snapshot of the cluster. If shards were snapshotted at very different times, there could be records with foreign keys that don’t exist. These snapshots are archived and compressed locally, then uploaded to a specific directory named according to the snapshot group being taken in HDFS.

A separate process is continuously monitoring this directory, waiting for a complete cluster to be available (i.e., every shard from the cluster exists). Once that happens, the files are downloaded, decompressed, and extracted in parallel across several servers. When an individual file finishes downloading, we launch the mongodump utility to write the data back to HDFS. This data is in a format that’s easier to consume in MapReduce jobs. For example, all our checkin data up to and including 2014-01-01 is in a single directory, stored in Mongo’s BSON format: /datasets/mongo/bson/ident=checkins/version=2014-01-01/col=checkins/

Reading the Data

Having our entire Mongo database available in Hadoop is nice, but it’s not very useful unless it can be read by MapReduce jobs. Every Mongo collection that gets read in Hadoop has an associated Thrift definition, and a custom input format turns the BSON into the Thrift object. Since we use Scala, the Thrift objects are generated Scala code, using Spindle.

When an entire cluster is finished for a particular day, we put a marker file to indicate this to Hadoop jobs. Another process updates our Hive tables to point to the new data, entirely transparent to people writing Hive queries.

– Joe Ennever (@TDJoe), Data Infrastructure Engineer

The Mathematics of Gamification

At Foursquare, we maintain a database of 60 million venues. And like the world it represents, our database is ever-changing, with users from all over the world submitting updates on everything from the hours of a restaurant to the address of a new barbershop. To maintain the accuracy of our venue database, these changes are voted upon by our loyal Superusers (SUs) who vigilantly maintain a watchful eye over our data for their city or neighborhood.

Like many existing crowd-sourced datasets (Quora, Stack Overflow, Amazon Reviews), we assign users points or votes based on their tenure, reputation, and the actions they take. Superusers like points and gamification. It rewards diligent, hard-working SUs (which are the majority) and punishes the few malicious “bad players.” But data scientists like probabilities and guarantees. We’re interested in making statements like, “we are 99% confident that each entry is correct.” How do we allocate points to users in a way that rewards them for behavior but allows us to make guarantees about the accuracy of our database?

At Foursquare, we have a simple, first-principles based method of resolving proposed venue attribute updates. We can gauge each Superuser’s voting accuracy based on their performance on honeypots (proposed updates with known answers which are deliberately inserted into the updates queue). Measuring performance and using these probabilities correctly is the key to how we assign points to a Superuser’s vote.

The Math

Let’s make this more concrete with some math. Let $H_0$ denote the true state of the world, either $1$ or $-1$, which we can interpret as a proposed update being true or false, respectively. We do not observe this but we know $H_0 = 1$ with a-priori probability $p_0$. User 1 votes $H_1$ (again, either $1$ or $-1$, representing “yay” or “nay”) with independent probability $p_1$ of agreeing with the truth $H_0$ and $(1-p_1)$ of disagreeing. Bayes’ Rule then gives us
\P(H_0 = 1 | H_1 = 1) & = \frac{\P(H_1 = 1 | H_0 = 1) \P(H_0 = 1)}{\P(H_1 = 1)} \\
& = \frac{p_0 p_1}{p_0 p_1 + (1-p_0)(1-p_1)} \\
& = \frac{\ell_0 \ell_1}{\ell_0 \ell_1 + 1}
\end{align*}where we have written the solution in terms of the likelihood ratio $\ell_k = \ell(p_k)$ given by
\[ \ell(p) = \frac{p}{1-p} \qquad \ell^{-1}(\cdot) = \frac{\cdot}{1 + \cdot}\,. \]Then we have that
\[ \P(H_0 = 1 | H_1 = 1) = \ell^{-1}(\ell(p_0) \ell(p_1))\,. \]In fact, it is easy to see that in the general case,
\[ \P(H_0 = 1 | H_1 = h_1) = \ell^{-1}(\ell(p_0) \ell(p_1)^{h_1})\,. \]Multiplication is hard so we will define the logit or log-likelihood function
\[ \logit: (0,1) \to (-\infty, \infty) \]given by
\[ \logit(p) = \log(\ell(p)) = \log\prn{\frac{p}{1-p}} \qquad \logit^{-1}(\cdot) = \frac{e^{\cdot}}{1 + e^{\cdot}}\,. \]Then we have
\[ \P(H_0 = 1| H_1 = h_1) = \logit^{-1}(\logit(p_0) + h_1 \logit(p_1))\,. \]

Continuing, assume that after user 1 casts their vote, user 2 votes $H_2$ with an independent probability $p_2$ of being correct (i.e. agreeing with $H_0$). We can think of the posterior probability $\P(H_0 = 1| H_1 = h_1)$ as our new prior and inductively repeat the above Bayesian analysis to obtain
\P(H_0 = 1 | H_1 = h_1, H_2 = h_2) & = \logit^{-1}\prn{\logit\prn{\P(H_0 = 1 | H_1 = h_1)} + h_2 \logit(p_2)} \\
& = \logit^{-1}(\logit(p_0) + h_1 \logit(p_1) + h_2 \logit(p_2))\,.
\end{align*}In fact, if we have $n$ votes $H_1, \ldots, H_n$, then we have
\begin{align} & \P(H_0 = 1| H_1 = h_1, \ldots, H_n = h_n) \nonumber\\
& \qquad\qquad = \logit^{-1}\prn{ \logit(p_0) + \sum_{k=1}^n h_k \logit(p_k) } \,. \label{eq:main}

The Solution

The above equation suggests that we should assign $s_k$ points or votes to user $k$ based on \begin{equation} s_k = \logit(p_k)\,. \label{eq:points} \end{equation} We can add up all the “yay” votes and subtract all the “nay” votes to obtain a score for the update. This score can easily be interpreted as a probability that the update is correct. We can set a certainty threshold $p$ (e.g. $p = 99\%$) as a threshold for a desired accuracy of this edit. Then, we accept a proposed edit as soon as \begin{equation} \logit(p_0) + \sum_{k=1}^n h_k \logit(p_k) \ge \logit(p) \label{eq:upper} \end{equation} and reject it as soon as \begin{equation} \logit(p_0) + \sum_{k=1}^n h_k \logit(p_k) \le – \logit(p)\,. \label{eq:lower} \end{equation}

In other words, if we take $t = \logit(p)$ to the the points threshold and $s_0 = \logit(p_0)$ to be the points allocated to a new proposed edit, then \eqref{eq:upper} and \eqref{eq:lower} become
\[ s_0 + \sum_{k=1}^n h_k s_k \ge t \]and
\[ s_0 + \sum_{k=1}^n h_k s_k \le – t\,, \]which are exactly the equations for voting you would expect. But now, they’re derived from math!

The Benefits

  • Efficient, data-driven guarantees about database accuracy. By choosing the points based on a user’s accuracy, we can intelligently accrue certainty about a proposed update and stop the voting process as soon as the math guarantees the required certainty.
  • Still using points, just smart about calculating them. By relating a user’s accuracy and the certainty threshold needed to accept a proposed update to an additive point system \eqref{eq:points}, we can still give a user the points that they like. This also makes it easy to take a system of ad-hoc points and convert it over to a smarter system based on empirical evidence.
  • Scalable and easily extensible. The parameters are automatically trained and can adapt to changes in the behavior of the userbase. No more long meetings debating how many points to grant to a narrow use case.
    So far, we’ve taken a very user-centric view of $p_k$ (this is the accuracy of user $k$). But we can go well beyond that. For example, $p_k$ could be “the accuracy of user $k$’s vote given that they have been to the venue three times before and work nearby.” These clauses can be arbitrarily complicated and estimated from a (logistic) regression of the honeypot performance. The point is that these changes will be based on data and not subjective judgments of how many “points” a user or situation should get.

Some practical considerations:

  • In practice, we might want a different threshold for accepting \eqref{eq:upper} versus rejecting \eqref{eq:lower} a proposed edit.
  • For notational simplicity, we have assumed that a false positives and false negatives in user $k$’s voting accuracy have the same probability $p_k$. In general, this is not the case. We leave it to the reader to figure the math of the general case.
  • Users like integer points. We have to round $s_k$ to the nearest integer. Because we can multiply linear equations like \eqref{eq:upper} and \eqref{eq:lower} by a positive constant, we can set $s_k = [\alpha \cdot \logit(p_k)]$ where $[\cdot]$ is the rounding function and $\alpha$ is a large positive constant. A large $\alpha$ will prevent the loss of fidelity.
  • We’ve explained how to obtain $p_1, p_2, \ldots$ from honeypots but how do we obtain $p_0$, the accuracy of newly proposed updates. One way is to use the above to bootstrap those accuracies from voting: we can use this voting technique to infer the accuracy of proposals by looking at what fraction of proposed updates are accepted!
  • Bayesian Smoothing. We assume a relatively low-accuracy prior for the accuracy of individuals. This is a pessimistic assumption that keeps new, untested users from having too much influence. It also rewards users for lending their judgment and casting votes as long as those are more accurate than our pessimistic prior. Of course, we also increase the likelihood of showing new Super Users honeypots to give them a chance to prove themselves.

–Michael Li
Data Scientist

Foursquare’s new notifications and the future of contextual mobile experiences

For the last year I’ve been obsessed with a new breed of mobile applications that are aware of a user’s context: who they are, where they are in the world, and what is going on around them.  Apps like Dark SkyGoogle Now, and Square Wallet are starting to enable amazing new real-world experiences that make users feel like they have superpowers by connecting them seamlessly to information.

Last month, we launched the new Foursquare notifications which automatically lets people know about the best dishes on the menu when they walk into a restaurant, or the top spots not to miss when they land in a new city.  In this talk at Data Driven NYC, I explain how we built this exciting new product from the data exhaust of millions of mobile devices, and how it sets the groundwork for an exciting new world of highly-targeted contextual experiences.

Data Driven NYC 20 // Blake Shaw of Foursquare from Matt Turck on Vimeo.

– Blake (@metablake)

A chat about data science and our fun visualizations

A little while back, I gave a talk on a Big Data Panel at the Stanford Graduate School of Business’s China 2.0 conference.  We had a great discussion about the uses of data science and the fun visualizations we do with our data at Foursquare. Check it out: 


How we built our Model Training Engine

At Foursquare, we have large-scale machine-learning problems. From choosing which venue a user is trying to check in at based on a noisy GPS signal, to serving personalized recommendations, discounts, and promoted updates to users based on where they or their friends have been, almost every aspect of the app uses machine-learning in some way.  All of these queries happen at a massive scale: we average one million Explore queries and six million check-ins every day. Not only do we have to process each request faster than the blink of an eye, but these millions of user interactions are giving us millions of data points to feed back into our models to make them better. We’ve been building out a Model Training Engine (MTE) to automate our (machine) learning from user data.  Here’s an overview to whet your appetite.

Fitting the model to the data rather than the data to the model.

Many models are built using linear regressions or similar approaches. While these models can help us quickly understand data (and we certainly make use of them), they make convenient but unrealistic assumptions and are limited in the kinds of relationships they can express. The MTE uses techniques liked Boosted Decision Trees or Random Forests (we have both a scikit-learn and an in-house MapReduce based implementation) to learn much more detailed and nuanced models that fit the data better.

Keeping models fresh and relevant.

With 6 million new check-ins a day, models quickly get stale. The MTE automatically retrains models daily based on the latest signals and the latest data. New signals and changes in old signals are immediately incorporated into new models and we monitor and deploy newer models when they outperform older ones.

Model training that scales with data and the organization.

With a large-scale, very interconnected system, changes made by other engineers on a seemingly unrelated app feature could throw off a very carefully calibrated model. How do we make model building scale across billions of check-ins and an entire organization without engineers stepping on each other’s toes?

To make models scalable across our data, we’ve rolled our own online learning algorithms and use clever downsampling techniques when we cannot load the entire dataset into memory. We use techniques like bagging and cross-validation to optimally understand how to combine different signals into a single prediction in a way that maximizes the contribution from each signal without picking up on spurious correlations (aka overfitting). This means that no one can throw off the model by adding or tweaking a signal. For example, If an engineer accidentally adds random noise (e.g. dice rolls) as a signal, the MTE would quickly detect that signal was not predictive and ignore it. This allows us to be open to new ideas and signals from pretty much anyone at the company, not just data scientists.

What’s more, the MTE can adapt to frequent UX and other product changes, all without human intervention. For example, if our mobile team changes the UI to make friends’ prior visits more prominent, the MTE will automatically detect that users are weighing social signals more heavily and adjust our models accordingly. And our automated Model Training Engine means that engineers can concentrate on building signals and let the model training select their best ones.

All of these quality improvements are translating into a better and smarter user experience. More details (with code) and quality improvements to come!

–Michael Li, Data Scientist

Foursquare Native Auth on iOS and Android: Developers, connect your users more quickly than ever

A few weeks ago we were excited to announce one of our most-wished-for features from our developer community, native authentication for iOS, and today we’re happy to announce we’ve also shipped support for native auth on Android in our latest release of Foursquare on Google Play! In a nutshell, this means that your users can connect their Foursquare accounts to your app without wrangling with messy WebViews and log-ins. Native authentication simply pops your users into the Foursquare app on their phone and lets them use their existing credentials there.

And even though this has only been out for a few short weeks, we love what our developers have been doing with it so far. If you want to see what native auth looks and feels like in the wild, install the latest version of quick check-in app Checkie: after using Foursquare to find a place for you and your friends to go, Checkie lets you check in with incredible speed.

Since Checkie uses our checkins/add endpoint, users need a way to log in. Below is what the app used to look like upon opening. Users are taken directly to a WebView where the user had to type in—and more importantly, remember, without the aid of Facebook Connect—their Foursquare credentials before continuing to use Checkie.

For this old flow to succeed, at least four taps are necessary, along with who knows how many keystrokes. Below is how the new Checkie flow works after integrating native auth: there’s a more informational screen when the app opens, and only two taps are necessary to begin actually using Checkie: “Sign in,” which bumps users to the Foursquare app where they can hit “Allow.”

How You Can Use Native Auth Today

You too can get started using this flow right away. We have libraries and sample code for iOS and Android available on GitHub that you can dive straight into. The details vary depending on OS, but the overall conceptual process is similar for both and outlined below—it should be familiar for those who have worked with 3-legged OAuth before.

  1. Update your app’s settings. You need to modify your app’s redirect URIs (iOS) or add a key hash (Android).

  2. Include our new libraries in your project. OS-specific instructions are found on their GitHub pages.

  3. Unless you want to use it as a backup mechanism, get rid of that (UI)WebView! Chances are, if you expect your users to have Foursquare accounts, they’ll have the app on their phones.

  4. Call our new native authorize methods. On iOS, it’s authorizeUserUsingClientId; on Android, it’s FoursquareOAuth.getConnectIntent then startActivityForResult with the returned intent. These methods bounce your users to the Foursquare app’s authorize screen or return appropriate fallback responses allowing them to download the app.

  5. If you user authorizes your app, your user will land back in your app. Follow OS-specific instructions to obtain an access code. This should involve calling either accessCodeForFSOAuthURL (iOS) or FoursquareOAuth.getAuthCodeFromResult (Android).

  6. Trade this access code for an access token. The access token (not access code) is what is eventually used to make calls on behalf of a particular user. There are two ways to do this:

    1. (Preferred) Pass the access token to your server, and then make a server-side call to—see step 3 under our code flow docs for details on the exact parameters needed. The response from Foursquare will be an access token, which can be saved and should be used to make auth’d requests. This method is preferable because it avoids including your client secret into your app. For more details, see our page on connecting.

    2. Call our new native methods to get an access token. On iOS it’s requestAccessTokenForCode. On Android it’s FSOauth.getTokenExchangeIntent followed by startActivityForResult (make sure you also make requisite changes to AndroidManifest.xml)

If you have any comments or questions about this new native auth flow—or anything API-related in general!—please reach out to

David Hu, Developer Advocate

Machine learning at Foursquare

In March, I spoke at Queens Open Tech about machine learning at Foursquare. The talk gives a nice overview of the kinds of insights we have about human behavior from check-in data and our machine-learning setup. Learn how we used smarter algorithms to get 20,000 people to try a new place every week.

Michael Li, Data Scientist at Foursquare

Quattroshapes: A Global Polygon Gazetteer from Foursquare

Foursquare geographic infrastructure relies on numerous pieces of open geo software: PostGIS, GDAL, Shapely, Fiona, QGIS, S2, and JTS as well as open geographic data: OSM,, US Census’ TIGER, Canada’s geogratis, Mexico’s INEGI and EuroGeoGraphics to name a few. We’ve been inspired by existing efforts around geographic data including the alphashapes and betashapes projects. We are eager and excited to contribute back to the open geo ecosystem with a few projects that I demoed recently at foss4g-na and State of the Map US.



Geographic polygon / boundary data is important to us as a way to aggregate venues around places like cities and neighborhood. Finding a good source of city data around the world has proved difficult. For that reason, we’ve been curating a set of worldwide polygon data that we’re calling Quattroshapes. Quattroshapes debuted at Nathaniel Vaughn Kelso’s talk at State of the Map US this past weekend. The project combines normalizing open government data with synthesizing new polygons out of flickr photos and Foursquare checkin data in places where open government data is unavailable. It’s called quattroshapes because it’s the fourth iteration (that we know of) of the work flickr did on alphashapes and SimpleGeo on betashapes also, it’s based on a quadtree.

We use this polygon data in twofishes, our coarse, splitting, forward and reverse geocoder based on the dataset. Twofishes has been open source since we first wrote it, but recently we’re releasing prebuilt indexes, complete with autocomplete and partial worldwide city-level reverse geocoding functionality. Twofishes is used in Foursquare Explore on the web. We’re looking at using it with our mobile applications as well to provide the best experience to our users. We’re also proud to say that our friends at Twitter have found a use for it as well.


We’re eager to collaborate with others on continuing to source and create this data. If you know of open (redistributable, commercial-friendly) datasets that we’ve missed, please let us know. If you have large sources of labeled point data that you think could help create more accurate inferred polygons, we’re interested in that too. If you make use of the quattroshapes or twofishes project, we’d love to hear how you’re using it and how it’s working out for you.

David Blackman, Geo Lead at Foursquare

Load tests for the real world

The gold standard for systems performance measurement is a load test, which is a deterministic process of putting a demand on a system to establish its capacity. For example, you might load test a web search cluster by playing back actual logged user requests at a controlled rate. Load tests make great benchmarks for performance tuning exactly because they are deterministic and repeatable. Unfortunately, they just don’t work for some of us.

At Foursquare, we push new versions of our application code at master/HEAD to production at least daily. We are constantly adding features, tweaking how old features work, doing A/B tests on experimental features, and doing behind-the-scenes work like refactoring and optimization to boot. So any load test we might create would have to be constantly updated to keep up with new features and new code. This hypothetical situation is reminiscent of bad unittests that basically repeat the code being tested — duplicated effort for dubious gain.

To make things even worse, a lot of our features rely on a lot of data. For example, to surface insights after you check in to a location on Foursquare we have to consider all your previous check-ins, your friends’ check-ins, popular tips at the venue, nearby venues that are popular right now, etc. etc. Creating an environment in which we might run a meaningful load test would require us to duplicate a lot of data, maybe as much as the whole site. A lot of data means a lot of RAM to serve it from, and RAM is expensive.

So we usually choose not to attempt these “canned” load tests. In lieu of a classic load test, our go-to pre-launch performance test is what we call a “dark test.” A dark test involves generating extra work in the system in response to actual requests from users.

For example, in June 2012, we rolled out a major Foursquare redesign in which we switched the main view of the app from a simple list of recent friend check-ins to an activity stream which included other types of content like tips and likes. Behind the scenes, the activity stream implementation was much more complex than the old check-in list. This was in part because we wanted to support advanced behavior like collapsing (your friend just added 50 tips to her to-do list, we should collapse them all into a single stream item).


Before and after the redesign

Perhaps surprisingly, the biggest driver of additional complexity was the requirement for infinite scroll, which meant we needed to be ready to materialize any range of activity for all users. Since the intention was for the activity stream to be the main view a user sees upon opening the Foursquare app, we knew that the activity stream API endpoint would receive many, many requests as soon as users started to download and use the new version of the app. Above all, we did not want to make a big fuss about this great new feature and then give our users a bad experience by serving errors to them when they tried to use it. Dark testing was a key factor in making the launch a success.

The first version of the dark test was very simple: whenever a Foursquare client makes a request for the recent check-ins list, generate an activity stream response in parallel with the recent check-ins response, then throw the activity stream response away. We then hooked this up to a runtime control in our application which permitted it to be invoked on an arbitrary percentage of requests, so we were able to generate this work for one percent, five percent, 20 percent, etc. of all check-in list requests. By the time we were a few weeks out from redesign launch, we were running this test 24/7 for one-hundred percent of requests, which gave us pretty good confidence that we could launch this feature without overloading our systems.

Click here to read the full post.

– Cooper Bethea (@cooperb)