Pants 1.0

pants-logo-cropped

Today, the Pants Project announced the release of Pants Build 1.0. Foursquare is a proud contributor to Pants, and we’d like to thank and congratulate our fellow contributors in the Pants community. Foursquare’s developer workflow benefits greatly from Pants, especially because of Pants’ caching of build artifacts, its dependency management, which enforces code hygiene, and its dependency graph, which allows easily compiling and testing all source affected by a change.

Caching

We have many extensions to Pants (generally in the form of Tasks), and the majority of them have caching for free right out of the box. Generally, a Jenkins CI worker runs a variety of jobs (lint, codegen, compile, etc) over each commit pushed upstream. These jobs benefit from the existing cache, so they run faster, and they also populate the cache with new results.

For developers iterating on a change, the majority of build results can be pulled down from the remote cache, so results only need to be recomputed for what the developer has changed. This reduces iteration time significantly and allows plugin developers to write more aggressive automated checks, since any given check will only need to be run incrementally. More aggressive automated checks, in turn, allow developers to focus on the aspects of programming that cannot be automated.

But the star of the caching story is Scala: Pants’ Zinc/scalac wrapper emits artifacts to the remote cache as soon as each target is successfully compiled, and double checks the cache before committing to compiling a given target. This “eager write, lazy read” strategy, coupled with randomized target ordering in very large compiles, allows several CI jobs running in parallel over similar target sets to avoid duplicating effort. We now have a CI job that compiles the entire Scala codebase for every commit pushed upstream. In practice, this is only actually building a few dozen targets each time; the vast majority of the build is already cached. This job makes it easier to track down which commit broke the build, and keeps the remote cache fully populated so that developers never have to recompile code that they didn’t change.

Dependency Management

In a very large codebase, it is critical to keep internal code dependencies under control. Without automatic enforcement of dependency rules, it is almost impossible to prevent circular dependencies, hairballs, or bad cross dependencies (e.g. library code depending on test code). Pants keeps our code very granular (one build target per JVM package per directory) and allows us to enforce arbitrary graph rules with easy-to-write tasks. Buildgen ensures that our BUILD file dependencies always represent the true dependencies of the source code in the target. Developers know instantly–in the form of a build break–if they have introduced a dependency that violates one of our rules. Some examples:

  • Library code may not depend on test code.
  • Service A may not depend on the concrete implementation of Service B, only the interface.
  • New services may not depend on known-bad “hairball” models. The models can’t be killed off entirely, but they can at least be contained. (Internally, we’ve taken to describing code as being in a “hairball” when it is an extremely-tangled mess of dependencies.)
  • Common code cannot depend on non-common code.
  • Open source code cannot depend on closed source code.

The configuration for these rules is simple and flexible, and the check itself is almost instantaneous. These automatic checks keep the repo sanitized so that developers don’t have to.

./pants test-changed

Likely the most frequently run Pants command at Foursquare is ./pants test-changed (and its equivalent for compilation, compile-changed), which underlies most of our convenience scripts and workflow guidelines for developers. test-changed takes advantage of the user’s SCM (git in our case) and Pants’ dependency graph to compile and test the targets that the user has changed locally. By default, it also compiles and tests the direct dependees of the changed targets. (If A depends on B, then we say A is a dependee of B). test-changed can optionally run against all transitive dependees, or no dependees at all.

This workflow shows off the power of Pants: with a single command, the user can compile and test exactly what could have been broken by their change–no more and no less. The developer doesn’t need to worry about or git grep for what other code might depend on their change–all of the information necessary to confirm that their change is safe is automatically inferrable. And of course the result is cached locally, so further invocations (without more source changes) will be quick no-ops.

Extensibility

We maintain a loose source plugin in our repo for various custom tasks and targets. “Loose source” means that the Python sources sit directly in the repo and are run straight from code–no need to reinstall anything in order for changes to take effect. You can see some of our open source examples here. It is extremely valuable to be able to quickly prototype and deploy extensions to the build tool. From trivial tasks like linters to an alternative JVM dependency resolver, Pants allows us to easily inject our own logic into the build pipeline.

Community

In addition to the technical benefits of using Pants, we have enjoyed the friendly, active, and responsive Pants community. Developers and users are very active on Slack and the mailing list, so we’ve never found ourselves wanting for help. We are very proud to see Pants reach version 1.0, a significant release in terms of user friendliness and API stability. If your organization or project is feeling the pain of long compile cycles or large dependency graphs–especially if you are a Scala or Python shop–then we encourage you to give Pants a try! If you’d like to see what an advanced, production installation of Pants looks like, check out Foursquare’s open source repo, fsqio.

— Patrick Lawson and Mateo Rodriguez (@mateornaut) plus the rest of the Foursquare #build team

P.S. we’re hiring!

Improving Our Engineering Interview Process

Previous Process and Motivations

Up until a year ago, Foursquare had a very typical interview process for a startup. We started with a phone call where the candidate implemented one or two simple questions in a collaborative editor. If they passed we would bring them on site for a series of hour long interviews, with mostly whiteboard coding and some system design discussion.

In terms of ensuring that the engineers we hired were great this process worked very well. Through this process, we hired a great engineering team. However, we were concerned that this process filtered out many other great engineers for reasons other than their abilities. Some possible reasons include variation between interviewers, candidate nervousness, or the unnaturalness of whiteboard coding generally. There was some evidence that this was happening since our peer companies would sometimes hire people we rejected. In addition, studies have shown there isn’t really a correlation between interview and job performance.

New Process

Today, we forgo technical phone interviews whenever possible. They’re typically unpleasant for everyone involved and we felt like the environment of a phone screen wasn’t conducive to learning about a candidate’s abilities comprehensively. Instead we give out a take-home exercise that takes about three hours. The exercise consists of three questions:

  1. A single-function coding question
  2. A slightly more complicated coding question that involves creating a data structure
  3. A short design doc (less than a page) on how to implement a specific service and its endpoints.

Every question we use is based on a real problem we’ve had to solve and has a preamble explaining the reason we need to solve this problem. If there is an obvious solution with a poor running time we mention it since we can’t help course-correct when the work isn’t being done live. We also provide scaffolding for the coding questions to save the candidate time.

Once we receive a take-home submission it is anonymized and put into a code review tool. The code and design doc are handed off to an engineer who does the grading. Because the submissions are anonymized the grading is done blind and reduces potential bias early in the process. For the coding questions we have a rubric which assesses the answers in terms of things like correctness, approach, and readability. The third question is used as a basis for on site discussion about systems design.

If the candidate passes, we invite them into the office for in-person interviews that include a follow-up on all take-home work already provided. For the two coding questions we have the candidate walk through the code they wrote on a laptop and explain how it works. We use any bugs or other interesting parts of their code to spark conversations, and we aren’t spending the session nit-picking. For the design question, the interviewer will pick a few parts of the design that the candidate seemed to have fleshed out and try to go in depth on them.

We still have some coding during our onsite process after the take-home, depending on the candidate’s work experience.. For all interviews we try to keep the atmosphere collaborative and conversational. Beyond coding, we talk about topics like designing a feature for a system, previous projects, previously used technologies (or new and exciting ones), and more.

Some things we learned

The biggest thing we learned is that a take home exercise based on problems we’ve had to solve is a more accurate gauge than phone screens. When candidates come on site their overall interview performance is very close to how they did on the take home, whereas with phone screens it was much more variable.

When the interviewer and candidate have the code and design already written at the start of the interview they are able to dive deeper and have more insightful discussions. It has a more natural feel and flow than live coding interviews because it more closely mirrors what our engineers do in their day-to-day.

In an early version of the process we tried to have the candidate find and explain their bugs. It was much more difficult to conduct for the interviewer and led to less consistent assessments and worse candidate experiences than the “code review” style mentioned above.

Because we start the interview process with something that is familiar, the changes we made help reduce the chance of nervousness and serve as a great warmup to the rest of the day. A bonus is that the feedback from candidates has been overwhelmingly positive. They especially seem to enjoy how interactive this interview is and that they can have a technical conversation to evaluate Foursquare.

In terms of the content of the take-home, early versions asked the candidate provide tests for their code. The tests provided basically no extra signal and were adding significantly to the time it took to complete so we eliminated them. We also added scaffolding to save the candidate time. It also had the bonus of making it much easier to run our auto-tests.

Next

We know the process still isn’t quite perfect, but we’re happy to have taken this first step. Foursquare’s other departments are starting to implement this into their interviewing process after seeing our successes! We’re excited to continue working on our process because we’re determined to build a great team and do it in the best way possible.

— Jeff Jenkins (@JeffWJenkins) and David Park (@dotdpark)

P.S. we’re hiring!

Cross Language Information Retrieval Via Taste Translation

What’s the best place for lamb in Santiago? If you’re a local, you’d know to hit up Jewel of India for their cordero magallanico or Barrica 54 to try the Garrón de Cordero. But what if you’re an English-speaking traveler visiting the city, and you don’t know a cordero from a cortado?

Whether you are looking for inspiration on where to shop or searching out the best iced coffee in town, Foursquare is able help you make a decision on where to go and what to do. Our rich venue data is created by tourists and locals alike and is used to recommend a hidden local gem as well as popular tourist attractions.

The Foursquare app and website is translated into 11 different languages, allowing users from around the world to write tips about their favorite places in their native language. More than half of all of our existing tips are in a language other than English. This plethora of data from locals around the world poses a specific problem though: How do we make use of it to help users who do not speak the language of the created content? How can we build the best experience for an English user visiting Tokyo, or a Turkish user visiting Sao Paulo?

Traditional search engines will use a variety of information retrieval techniques to find the documents/web pages that are most relevant to a given query, typically in the same detected language of the query. While this model works well for generic search engines, it would leave Foursquare users travelling to destinations that did not have a lot of data in their native language with suboptimal results when searching for specific foods, or only finding results based off of tips left by tourists and not the locals.


Santiago lamb before

English lamb search in Santiago without translations


mission

Korean gelato search in NYC without translations


The paucity of non-native language content within a specific geographic region, however, does not mean that we are completely blind as to the quality and content of the venues within that region. Native non-English speakers in other countries are using Foursquare the same way you are using it at home. Up to this point, our users experience when searching for tastes has been different between languages. A Japanese user visiting the US who searched for tastes such as gelato (ジェラート) within the Japanese Foursquare app on their phone would only get back results where other Japanese users had written about ジェラート, even though there is a tremendous amount of English gelato data that would be useful to that user.

To address this problem, we are happy to announce that we have started the process of improving the underlying taste models for each individual language to include the appropriate language translations. Under the hood, you can visualize our taste model as a large ontology of terms that are represented in the form of directed acyclic graphs. Previously, each language’s taste model was distinct from every other, with no links between them. Adding the translation links to these ontologies allows us to make use of the language specific curations (i.e. 納豆/natto implying 朝食系/breakfast food in the japanese taste model) that were made in the taste models of each language, which in turn, provides an even better, more localized experience when using Foursquare abroad. No longer will a user’s results be limited by the extent to which other tourists who speak your language have left content in the area you are visiting. Instead, you will truly be able to live like a local by leveraging all of the foreign language data that was previously indecipherable by your queries.

More concretely, there are two very specific changes that users will have access to. The first is under the hood query expansion for taste matches into the languages for which we have translations. For example, a query for “lamb” in Santiago, Chile, will now be automatically expanded to incorporate Spanish results for “cordero” by traversing the multilingual taste graph.


Santiago lamb before

English lamb search in Santiago with translations


mission

Japanese gelato search in NYC with translations


This query expansion results in huge cross-language information retrieval gains that continue to make Foursquare search the best in the world.

Beyond retrieval wins, these translation links also give Foursquare the ability to highlight the top tastes at a venue on the venue page, even if there is no content in your language to support that taste.


English tip snippets in Japanese

NYC venue as seen by a Japanese user


Taste pile in Japanese

Taste filtering in Japanese


We currently have translations enabled between English, Spanish, Japanese, and Indonesian, but are working to get all of our supported languages enabled. We’re incredibly excited about the improved experience and fidelity of results that this new feature brings to the table.

Interested in helping to continue pushing the envelope and improving Foursquare? Take a look at our job openings here.

Kris Concepcion (@kjc9), Ben Mackey, Matt Kamen (@losfumato), Daniel Salinas (@zzorba42)

Finding Similar Venues in Foursquare

Foursquare has a deep collection of more than 65 million venues. One of the signals we use to help users discover new places they’ll love is similar venues. Similar venues not only powers the features shown below, but is also an underlying scoring component for our search and recommendation algorithms.

downloadScreen Shot 2015-11-12 at 1.21.25 PM-1

While we’ve had a similar venues signal for quite a while, we’ve recently revisited the problem from the ground up. This post will provide a high level overview of how we approached the latest update to venue similarity.

The original venue similarity job was composed of two hadoop jobs using our Scala wrapper around raw Mapreduce with separate Luigi configurations. After the rewrite, the whole similar venue calculation was contained in one Scalding job with a single Luigi configuration, increasing code legibility. We also optimized the resulting HFile to be smaller than the previous one by re-organizing the Thrift structure. Thanks to this, the online fetching of similar venues for user-facing services results in lower loads for the servers.

Covisitation

The premise of using covisitation to calculate venue similarity is that if a lot of people tend to frequent two venues, they may be similar to each other. For example, in San Francisco, it turns out that many users that visit Smitten Ice Cream in Hayes Valley also visit Humphry Slocombe in the Mission, both similarly hip ice cream shops with seasonal flavors. Covisitation can be a very valuable signal, especially when we have over 8 billion check-ins to work with.

Computing covisitation at scale has its challenges. At a high level, what we want to do is compute the cosine similarity between venues. We model each venue as a vector of users: if we have n users, each venue is a vector of length n. The vector’s value at index i is 1 if the user at index i has been to that venue and 0 otherwise. We can naively compute cosine similarity by mapping over each user’s check-in data and emitting all pairs of visited venues. This yields O(n^2) blow ups for users with deep histories that then lead to unacceptable runtimes and sadness on our Hadoop cluster. Here are a few important considerations for optimizing the calculation:

  • Sampling: The folks at Twitter came up with the DISCO algorithm outlined here, which describes a method for intelligently sampling data without greatly altering the final similarity measures.
  • Reflexivity: Computing the covisitation of venue A -> venue B means you don’t need to compute it the other way around. If we order the data properly, we can greatly reduce the number of computations needed.

Covisitation by itself is not without its flaws. One problem is there are super-hub venues like airports and train stations that people pretty much have to visit that can introduce noise despite efforts to normalize them out. There are also problems of locality, a corner deli could be next door to a diner, and share many common visitors, but not really be very similar.

Category

When you think of a restaurant you commonly go to, you might categorize it in only one category. For example, my favorite BBQ restaurant can simply be categorized as a “BBQ restaurant”. However, a venue can have multiple categories. A BBQ restaurant can be considered just a generic “restaurant”, or an American restaurant. In Foursquare, each venue we has a primary category but can have other sub-categories as well. Furthermore, categories are modeled in a tree hierarchy, so each category could have a parent or children categories. Previously, we only computed similarity between categories and their parents. A Dim Sum restaurant will have an exact category match with another Dim Sum restaurant, but it will have a mismatch in the deepest category node with a Hong Kong restaurant.

B+RVRRl1aXvJAAAAAElFTkSuQmCC

While this is good in a naive examination, it doesn’t prove to be the best way to determine category similarity. For example, a Hot Pot restaurant and a Chinese restaurant are fairly similar, but because of how they are modeled it was difficult to reason about the exact similarity between them. We want a Hot Pot restaurant to be more similar to a Chinese restaurant instead of a Filipino restaurant.

Image 2015-12-07 at 5.54.54 PM

To alleviate this issue we used maximum likelihood estimates to answer the following question: Given a venue labeled with category x and at least one other category, what is the probability that it’s also labeled with category y? Looking at co-occurrences between pairs of categories is much more fine-grained compared to looking at exact matches between category lists. Using this metric gives us the desired outcome of a Hot Pot restaurant being more similar to a Chinese restaurant (calculated similarity of 27%) compared to a Filipino restaurant (calculated similarity of 0.0%!). At this point, we’ve already made a big improvement.

Looking at covisitation between venues with a high category similarity is exactly the situation we’ve described above: if a user frequents Ice Cream shop X and Ice Cream shop Y, X and Y are very likely to be similar. This metric works well for major cities with lots of foot traffic and rich venue information. However, what if we are looking at a city or a neighborhood that is less populated?

Tastes

Tastes is another valuable feature in determining venue similarity only available on Foursquare. Each venue on Foursquare has tastes associated with it, so if we have venues without a lot of check-ins we can rely on tastes for coverage. At a high level, taste similarity is calculated by looking at the tf-idf for matching tastes in two venues (to learn more about another fun way we use tf-idfs, check this out: Geographic Taste Uniqueness). With this metric we can group venues with similar primary tastes even if we don’t have rich covisitation data. Taste similarity can also help find venues that seem unrelated on the surface but provide similar food/experiences. For example, if you’re a fan of Poutine you will love both Citizen’s Band, an American Restaurant, and Jamber Wine Pub, a Wine Bar and Gastropub.

f88C26gsbTEJAAAAABJRU5ErkJggg==

All these parts work together to produce a list of venues similar to your favorite venue. As a result of this effort, we’ve greatly improved the number of similar venues served, and the new method has resulted in a 7% increase in CTR for similar venues for the web in the United States and 3% increase globally. You can check out similar venues on the web or in the app when you’re looking at a venue page. If you’re interested in working with big data to make meaningful user experiences like these, come join us!

— Billy Seol (@pabpbapapb)

Improving Language Detection

At Foursquare, we attempt to personalize as much of the product as we can. In order to understand the more than 70 million tips and 1.3 billion shouts our users have left at venues, each of those pieces of text must be run through our natural language processing pipeline. The very foundation of this pipeline is our ability to identify the language of a given piece of text.

Traditional language detectors are typically implemented using a character trigram model or a dictionary based ngram model. The accuracy of these approaches is directly proportional to the length of the text being classified. For short pieces of text like tips in Foursquare or shouts in Swarm (see examples below), however, the efficacy of these solutions begins to break down. For example, if a user writes only a single word like “taco!” or an equally ambiguous statement like “mmmm strudel,” a generic character- or word-based solution would not be able to make a strong language classification on those short strings. Unfortunately, given the nature of the Foursquare products, these sorts of short strings are very commonplace, and we needed a better way to accurately classify the languages in which they are written.



To this end, we decided to rethink the generic language identification algorithms and build our own identification system, making use of some of the more unique aspects of Foursquare data: the location of where the text was created and the ability to aggregate all texts by their writer. While there are many multilingual users on our platform, the average Foursquare user only ever writes tips or shouts in a single language. Given that fact, it seemed inefficient to apply a generic language classification model against all of the text that a single user creates. If we have 49 data points that strongly point to a user writing in English, and that user’s 50th data point is an ambiguous text that a generic language model thinks could be German or English (with 40% and 38% accuracy respectively), chances are that the string should correctly be tagged as English and not German, even if the text contains German loanwords. Our solution to this problem was to build a custom language model for every one of our users that leave tips or shouts, and then to allow those user language models to help influence the result of the generic language detection algorithm.

The first step in this process is to run generic language detection on every tip and shout in the database. Each tip and shout is associated with a venue that has an explicit lat/long associated with it. We then reverse geocode that lat/long to the country in which that venue is located, which lets us know the country that the user was in when they wrote the text. Next, we couple the generic language detection results with this country data to create a language model for every country. While this per-country language distribution model may not correctly resemble the real life language distributions of a given country, it does model the language behavior of the users that share text via Foursquare and Swarm in those countries.

Example of top 5 languages and weights calculated in the country language models:

With country models in hand, we then do a separate grouping of strings by user and are able to calculate a language distribution on a per-user basis. However, one of the problems with this approach is not every user has enough data to create a reliable user model. A new user who is multilingual will cause classification problems with this system early on due to the lack of data to produce a reliable model. To solve this particular problem we use the language model of the dominant country for that user as a baseline. When a user has little to no data for their user language model, we allow the country model to be merged into the low information user model. As more data becomes available for a given user, we slowly weight the user model higher than the dominant country model until we have enough data where the user model becomes the more dominant model between the two.

Finally, we create per country, orthographic feature models using the strings that are grouped by country. For this model, we have a set of 13 orthographic features that, when a string triggers one of them, the string’s generic language identification results are added to the other strings results that triggered for that feature, in a specific country. This allows us to have a feature “containsHanScript” and have a completely different language distribution in China than the one that is calculated for Japan, where both Chinese and Japanese contain characters from the Han script. Other examples of this are Arabic vs. Farsi with the “containsArabicScript” feature, Russian vs. Ukranian vs. Bulgarian with the “containsCyrillicScript” feature, and all romance languages with the “containsLatinScript” feature.

With the user models and the orthographic feature models in place, we then rerun language identification on all of our tips and shouts, using the appropriate user’s language model and applying any triggered orthographic feature model that the string matches, and we merge the 2 results together, along with the generic language detectors’ results for a given string and we’re left with a higher quality language classification. On preliminary analysis, we were able to correctly tag an additional ~3M tips and ~250M shouts using this method.

Examples of corrected language identification:

If these kinds of language problems interest you, why not check out our current openings!

— Maryam Aly (@maryamaaly), Kris Concepcion (@kjc9), Max Sklar (@maxsklar)

Personal recommendations for the Foursquare homescreen

Top Picks Screenshot

Earlier this summer, we shipped an update to Foursquare on Android and iOS focused on giving each user a selection of “top picks” as soon as they open the app. Our goals with this new recommendation system were to improve the level of personalization and deliver fresh suggestions every day. Under the hood, this meant a rethinking of our previous recommendations flow.

Previous work

Retrieve_Rank_Justify

Previous iterations of our recommendation flow relied on a fairly traditional search pipeline that ran exclusively online in our search and recommendations service.

  • O(100’s) of candidate venues are retrieved from a non-personalized store such as our venues ElasticSearch index

  • Personalized data such as prior visit history, similar venue visits, friend/follower prior history is retrieved and used to rank within these candidate venues.

  • For the top ranked venues we choose to show the user, short justification snippets are then generated to demonstrate why this venue matches the user’s search

This works well for an intentful searches such as “pizza” where a user is looking for something specific but is restricting for broader query-less recommendations. For broad recommendations, the depth of personalization becomes limited by the size of the initial set of non-personalized candidate venues in the retrieval phase. Simply increasing the size of this set of candidates online would be computationally expensive and increase request latencies past acceptable limits so we looked towards better utilizing offline computation.

Personalized retrieval

To establish a larger pool of personalized candidate venues, we have created an offline recommendation pipeline. For each user, we have a pretty detailed understanding of the neighborhoods and locations they frequent with the technology we call Pilgrim. Given these locations for a user, we generate a ranked personalized list of recommendations via a set of Scalding jobs on our Hadoop cluster. These jobs are then run at a regular interval by Luigi workflow management and then served online by HFileService, our immutable key-value data service which uses the HFile file format from HBase.

Offline recommendations flow

The personalized sources of candidate venues come from a set of offline “fetchers” also implemented in Scalding:

  • Places friends / people you follow have been, left tips, liked and saved

  • Venues similar to those you’ve liked in the past, both near and far

  • Places that match your explicit tastes

For our more active users, there can be thousands of candidate venues produced by these fetchers, an order of magnitude more than our online approach. We can afford to consider such a large set since we’re processing them offline, out of the critical request path back to the user.

Non-personalized retrieval

Several non-personalized sources of candidate venues are also used.

  • The highest rated venues of various common intents (dinner, nightlife, etc).
  • Venues that are newly opened and trending in recent weeks

  • Venues that are popular with out-of-town visitors (if the user is traveling)

  • Venues that are vetted by expert sources like Eater, the Michelin Guide, etc.

The non-personalized sources not only provide a robust set of candidates for new users whom we don’t know much about yet, but also provide novel and high quality venues for existing users. While personalization should skew a user’s recommendations towards those they’ll find relevant and intriguing, we want to avoid creating a “personalization bubble” that misses great places just because the user doesn’t have any personal relation to them.

Logging

For each homepage request, the recommendation server logs which venues have been shown to HDFS via Kafka. These server-side logs are combined with client-side reporting of scroll depth, giving us a combined impressions log of which venues we have previously shown users, so we can avoid showing repeated recommendations. This impression information is used for both ranking and diversification.

Ranking

Each candidate venue is individually scored with a combination of signals, seeking to balance factors such as venue novelty, distance, personalized taste match, similarity to favorites, and friend/follower activity. The top ~300 candidates are then kept in an HFile and served on a nightly basis.

Diversification

Some product requirements are difficult to fulfill solely via scoring each venue independently. For instance, it is undesirable to show too many venues of the same category or to show only newly opened restaurants. To introduce diversity, before selecting the final set of venues to show, we try and enforce a set of constraints while also maintaining the ranked order of the candidate venues.

Justifications

Selecting a list of venues to show isn’t the end of the process. Every venue that makes it onto a user’s home screen comes with a brief explanation of what’s interesting (we believe) to them about this venue. These “justifications” are the connective tissue between the sophisticated data processing pipeline and the user’s experience. Each explanation provides not only a touch of personality but also a glimpse into the wealth of data that powers these recommendations.

To accomplish this, the “justifications service” (as we call it internally) is responsible for assembling all the information we know about a venue and a user, combining it, ranking it, and generating a human readable explanation of the single most meaningful and personalized reason the user may be interested in this place.

Broadly speaking, the process can be divided into four stages: Data fetching -> Module execution -> Mixing/Ranking -> Diversification/Selection. Each type of justification that the system can produce is represented by an independent “module”. The module interface represents a simple IO contract, describing a set of input data and returning one or more justifications, each with a generated string and a score. Each module is designed to run independently, so after all the data is fetched, the set of eligible modules runs in parallel. When each module has had an opportunity to produce a justification, the candidates are merged and sorted. A final pass selects a single justification per venue, ensuring that not only are the most relevant justifications chosen but that there is a certain amount of diversity in insights provided. And this happens at runtime on each request.

Here are just a few examples of how the finished product appears in the app:

Local institution justification

SimilarPlace

Uncle boons

Poke

ScenicView

Upcoming work

With the product in the hands of users, we’re working on learning from user clicks to improve the quality of our recommendations. We’re also running live traffic experiments to test different improvements to our scorers, diversifiers and justifications. Finally, we’re improving the online layer so recommendations can quickly update in response to activity in the app such as liking or saving a venue to a to-do list. If you’re interested in working on search, recommendations or personalization in San Francisco or New York, check out foursquare.com/jobs!

Ben Lee and Matt Kamen

How the World Looks to Your Phone

[Cross-posted from a Quora answer I wrote here.]

One of Foursquare’s most useful features is its ability to send you just the right notification at just the right time, whether it’s visiting a new restaurant, arriving in a city, or hanging out with old friends at the neighborhood bar:

Image 2015-07-01 at 3.32.22 PM

We take a lot of pride in our location technology (also known as Pilgrim) being the best in the industry, enabling us to send these relevant, high-precision, contextual notifications.

Pilgrim is actually not just a single technology, but a set of them, including:

  • The Foursquare database (7 billion check-ins at 80 million places around the world)
  • Stop detection (has the person stopped at a place, or is the person just stopped at a traffic light?)
  • “Snap-to-place” (given a lat/long, wifi, and other sensor readings, at which place is the person located?)
  • Client-side power management (do this all without draining your battery!)
  • Content selection (given that someone has stopped at an interesting place, what should we send you?)
  • Familiarity (has the person been here before? have they been in the neighborhood? or is it their first time?)
  • (and much more…)

We could write a whole post about each of these, but perhaps the most interesting technology is “snap-to-place.” It’s a great example of how our unique data set and passionate users allow us to do things no one else can do.

We have these amazing little computers that we carry around in our pockets, but they don’t see the world in the same way that you and I do. Instead of eyes and ears, they have GPS, a clock, wifi, bluetooth, and other sensors. Our problem, then, is to take readings from those sensors and figure out which of those 80 million places in the world that phone is located.

Most companies start with a database of places that looks like this:

Image 2015-06-29 at 8.42.32 PM

(That’s Washington Square Park in the middle, with several NYU buildings and great coffee shops nearby.)

For every place, they have a latitude and longitude. This is great if your business is giving driving directions or making maps. But what if you want to use these pins to figure out where a phone is?

The naive thing to do is to just drop those pins on a map, draw circles around them, and say the person is “at” a place if they are standing inside the circle. Some implementations also resize the circles based on how big the place is:

Image 2015-06-29 at 8.42.55 PM

This works fine for big places like parks or Walmarts. But in dense areas like cities, airports, and malls (not to mention multi-story buildings, where places are stacked on top of each other), it breaks down. All these circles overlap and there’s no good way to tell places apart.

So if that’s not working, you might spend a bunch of time and money looking at satellite photos and drawing the outline of all the places on the map:

Image 2015-06-29 at 8.43.10 PM

This is incredibly time consuming, but it’s possible. Unfortunately, our phones don’t see the world the way a satellite does. GPS bounces off of buildings, gives funny readings and bad accuracies. Different mobile operating systems have different wifi and cell tower maps and translate those in different ways into latitude and longitude. And in multi-story buildings, these polygons sometimes encapsulate dozens of places stacked vertically. The world simply doesn’t look like nice neat rectangles to a phone.

So what does Foursquare do? Well, our users have crawled the world for us and have told us more than 7 billion times where they’re standing and what that place is called. Each time they do, we attach a little bit more data to our models about how those places look to our phones out in the real world. To our phones, the world looks like this:

Image 2015-06-29 at 8.49.40 PM

This is just a projection into a flat image of a model with hundreds of dimensions, but it gives an idea of what the world actually looks like to our phones. We use this and many other signals, (like nearby wifi, personalization, social factors, and real-time check-ins) to help power the push notifications you see when you’re exploring the city. Glad you’re enjoying them!

Interested in the machine learning, search, and infrastructure problems that come with working with such massive datasets on a daily basis? Come join us!

Andrew Hogue, Blake Shaw, Berk Kapicioglu, and Stephanie Yang

Managing Table and Collection Views on iOS

As most iOS developers can tell you, dealing with UITableView or UICollectionView can be a pain. These UIKit views form the basis of most iOS apps, but the numerous callback methods that you have to coordinate can be a hassle to manage. There is a lot of boilerplate necessary to get even the simplest of views up and running, and it is easy to create a mismatch between delegate methods which causes the app to crash.

At Foursquare we’ve solved this issue the way we usually do: we built an intermediate layer with a nicer interface that talks to Apple’s interface for you. We call it FSQCellManifest, and today we are open sourcing it for use by the wider iOS developer community.

In short, FSQCellManifest acts as the datasource and delegate for your table or collection view, handling all the necessary method calls and delegate callbacks for you, while providing a much easier to use interface with less boilerplate. It has a ton of built in features, and is flexible enough that we use it on every screen in both our apps. It saves us a ton of engineering time and effort, and hopefully it will for you as well.

You can find more documenation and an example app on the project’s Github page. And check out Foursquare’s open source portal to find all the other projects we’ve released.

Brian Dorfman

Gson Gotchas on Android

This is Part 2 in our 2 part series on latency. In Part 1, we discuss some techniques we use for measuring latency at Foursquare. Here we’ll discuss some specific Gson related changes we made to improve performance in our Android apps.

Shortly after the launch of Foursquare for Android 8.0, we found that our app was not as performant as we wanted it to be. Scrolling through our homepage was not buttery smooth, but quite janky with frequent GC_FOR_ALLOC calls in logcat. Switching between activities and fragments was not as quick as it should be. We investigated and profiled, and one of the largest items that jumped out was the amount of time spent parsing JSON. In many situations, this turned out to be multiple seconds even on relatively modern hardware such as the Nexus 4, which is crazy. We decided to dig in and do an audit of our JSON parsing code to find out why.

In Foursquare and Swarm for Android, practically all interaction with the server is done through a JSON API. We use Google’s Gson library extensively to deserialize JSON strings into Java objects that we as Android developers like working with. Here’s a simple example that converts a string representation of a venue into a Java object:

This works, but we don’t actually need the whole JSON string to begin parsing. Fortunately, Gson has a streaming API that let’s us parse a JSON stream one token at a time. Here’s what a simple example of that would look like:

So we did this, but still didn’t see any significant speed up or smoother app performance. What was going on? It turns out that we were shooting ourselves in the foot with our usage of custom Gson deserializers. We use custom deserializers because there are times when we don’t want a strict 1:1 mapping between JSON and Java objects they deserialize to. Gson allows for this, and provides the JsonDeserializer interface to facilitate this:

The way you use this is you implement this interface and tell it what type you want it to watch out for. You then register this with the Gson instance you are using to deserialize, and from then on whenever you try to deserialize some JSON to a certain type Type typeOfT, Gson will check to see if a custom deserializer is set up to handle that type and if so, will call that custom deserializer’s deserialize method. We use this for a few types, one of which happens to be our outermost Response type that encapsulates all Foursquare API responses:

The problem here is that despite us thinking we were using Gson’s streaming API, our usage of custom deserializers would cause whatever JSON stream we were trying to deserialize to be completely read up into a JsonElement object tree by Gson to be passed to that deserialize method (the very thing we were trying to avoid!). To make matters worse, doing this on our outermost response type that wraps every single response we receive from the server prevents any kind of streaming deserialization from ever happening. It turns out that TypeAdapters and TypeAdapterFactorys are what are now preferred and recommended over JsonDeserializer. Their class definitions look roughly like this:

Note the JsonReader stream being passed to the read() method as opposed to the JsonElement tree. After being enlightened with this information, we updated our custom deserializers to extend TypeAdapters and TypeAdapterFactorys and noticed significant parse time decreases of up to 50% for large responses. More importantly, the app felt significantly faster. Scroll performance that was previously janky from constant GCs due to memory pressure was noticeably smoother.

Takeaways

  • Use GSON’s streaming APIs, especially in memory-constrained environments like Android. The memory savings for non-trivial JSON strings are significant.
  • Deserializers written using TypeAdapters are generally uglier than those written with JsonDeserializers due to the lower level nature of working with stream tokens.
  • Deserializers written using TypeAdapters may be less flexible than those written with JsonDeserializers. Imagine you want a type field to determine what an object field deserializes to. With the streaming API, you need to guarantee that type comes down in the response before object.
  • Despite its drawbacks, use TypeAdapters over JsonDeserializers as the Gson docs instruct. The memory savings are usually worth it.
  • But in general, avoid custom deserialization if at all possible, as it adds complexity

Interested in these kinds of problems? Come join us!

Matthew Michihara

Measuring user perceived latency

At Foursquare, tracking and improving server-side response times is a problem many engineers are familiar with. We collect a myriad of server-side timing metrics in Graphite and have automated alerts if server endpoints respond too slowly. However, one critical metric that can be a bit harder to measure for any mobile application is user perceived latency. How long did the user feel like they waited for the application to startup or the next screen to load after they’ve tapped a button? Steve Souder gives a great talk about the perception of latency in this short talk.

For a mobile application like Foursquare, user perceived latency is composed of several factors. In a typical flow, the client makes an HTTPS request to a server, the server generates the response, the client receives a response, parses the response and then renders it. Client Timing Diagram

We instrumented Foursquare clients to report basic timing metrics in an effort to understand user perceived latency for the home screen. Periodically, the client batches and reports these measured intervals to a server endpoint which then logs the data into Kafka. For example, one metric the client reports is the delta between when the client initiated a request and when the first byte of the response is received. Another metric the client reports is simply how long the JSON parsing of the response took. On the server-side, we also have Kafka logs of how long the server spent generating a response. By combining client-side measured timings with server-side measured timings using Hive, we are able to sketch a rough timeline of user perceived latency with three key components: Network transit, server-side time, and client-side parsing and rendering. Note that there are many additional complexities within these components, however this simple breakdown can be a useful starting point for further investigation. .

The above bar chart shows a composite request timeline that is built using the median timing of each component from a sample of 14k Foursquare iPhone home screen requests. In this example, the user might wait nearly two seconds before the screen is rendered, and most of it was actually due to network and client time rather than server response time. Let’s dive into network and client time deeper.

Network time

The next chart below splits out requests in Brazil versus requests in the US.

The state of wireless networks and the latency to datacenter are major factors in network transit time. In the above comparison, the median Brazil request takes twice as long as one in the US. At Foursquare, all API traffic goes through SSL, to protect user data. SSL is generally fine for a connection that has already been opened, but the initial handshake can be quite costly as it typically requires two round-trips additional to a typical HTTP connection. It’s absolutely critical for a client to reuse the same SSL connection between requests, or this penalty will be paid each time. Working with a CDN to provide SSL early termination can also be incredibly beneficial at reducing the cost of your first request (often the most important one, since the user is definitely waiting for it to finish). For most connections, the transmission time is going to dominate, especially on non-LTE networks. To reduce the number of packets sent over the wire, we eliminated unnecessary or duplicated information in the markup and we were able to cut our payload by more than 30%. It turns out, however, that reducing the amount of JSON markup also had a big impact on the time spent in the client.

Client time

The amount of time spent processing the request on the client is non-trivial can vary wildly depending on the hardware. The difference in client time in the US vs Brazil chart is likely due to the different mix of hardware devices in wide use in the market. For example, if we were to plot the median JSON parsing times across different iPhone hardware, we would see a massive difference from older iPhone 4’s to the latest iPhone 6’s. Although not as many users are on the older hardware, it’s important to understand just how much impact needless JSON markup can have.

In addition to JSON processing, another important topic for iOS devices is Core Data serialization. In our internal metrics, we found that serializing data into Core Data can be quite time consuming and is similarly more expensive for older hardware models. In the future, we are looking at ways to avoid unnecessary Core Data access.

A similar variation can be found across Android hardware as well. The chart below shows the median JSON parsing times of various Samsung devices, (note that the Android timing is not directly comparable to the iPhone timing, as the Android metric is measuring the parsing of the JSON markup to custom data structures while the iPhone measurement is parsing straight to simple dictionaries). android_parse

In our next engineering blog post, we will discuss some critical fixes that were made in Android JSON parsing.

Conclusion

Measurement is an important first step towards improving user perceived latency. As Amdahl’s law prescribes, making improvements on the largest components of user perceived latency will of course have the largest user impact. In our case, measurements pointed at taking a closer look at networking considerations and client processing time.

— Ben Lee (@benlee) & Daniel Salinas (@zzorba42)