Moar stats in your Merchant Platform API
Ever since we first launched our Merchant Platform API, a lot of people have been asking if they can get more programmatic access to the data they see in their Merchant Dashboard. Today, we’re doing just that. Our Merchant Platform API now includes all the data from the dashboard:

You can use the venues/stats API endpoint to get the data behind these charts and graphs for any venue you manage over any time range, all in an easy, programmatic fashion.
We’ve also opened up access to daily check-in counts and new visitor counts through the venues/timeseries API endpoint. A similar endpoint is also available for your campaigns, providing not only daily check-in counts, but also daily views and unlock counts for the specials you’re running. These endpoints will soon be available for use with venue groups, too, so you can see aggregated stats across any subset of the venues you manage.
Have fun integrating foursquare stats into an existing dashboard, or even build your own custom one. And if you own a business, be sure to claim your venue on foursquare to gain access to these statistics and more!
- Jason Liszka, software engineer
foursquare’s data and the Explore recommendation engine
On Thursday, 7/28, @benlee gave a talk at the Check-in and Waffles developer event at foursquare SF covering our unique data and the Explore recommendation engine. Take a look:
Foursquare Data and Explore Final
If you love data as much as we do, join us (@injust, @rathboma, @benlee, @maxsklar and @chimeracoder) on foursquare’s data team. We’re hiring!
APIv2: Woulda, coulda, shoulda
As we sunset foursquare APIv1 and announce some amazing new milestones for APIv2, now seemed like as good a time as any to reflect on some of the decisions we made in APIv2 and see how they’re holding up. Fortunately, we were able to draw on both our experience with APIv1 and from tight iteration with our client developers (especially Anoop, who is awesome!). We’re pretty happy with the result, but we still made a few mistakes. Hopefully there are some lessons for anybody else out there who gets stuck designing an API.
JSON
Verdict: no brainer
Flexible. Understood by everybody. (Sorry, Dave Winer!)
OAuth2 and HTTPS-only
Verdict: as good as it can be
OAuth2 is fairly straightforward to implement (although the spec is still churning a bit – the userless flow arrived too late for us to use it!) for both servers and clients, and offloading encryption to HTTPS instead of implementing it ourselves makes it light years easier to get right.
Consumers pushed back a bit on our decision to not let third party applications just use usernames and passwords to authenticate (instead, we force them to use a web-based authorization page). We often tell people, “You shouldn’t be any more comfortable entering your foursquare password into some random app than you are putting your Facebook password into some random app.” Still, the user experience for mobile applications could be better, and on some mobile platforms like Blackberry and Nokia, it’s hard for application developers to interact with the browser in the way that OAuth requires.
The security properties of OAuth2 are far from perfect, but it’s not the protocol’s fault. It would be great if client devices provided a way to keep client secrets secure, even from self-signed SSL certificate man-in-the-middle attacks. And it would be great if the “implicit grant” flow could be more robust against token leaks by the consuming page.
REST lite
Verdict: good!
We decided to use resourceful URLs for certain key objects, like users/USER_ID or venues/VENUE_ID, with actions and connections hanging off of them. But not every resource has a URL (e.g. badge unlocks, todos, comments). This keeps the API minimal and reduces the surface area we need to keep backwards compatible.
On the flip side, we decided not to have deeply nested URLs, e.g. users/USER_ID/tips/TIP_ID, instead opting for flat tips/TIP_ID. This led to much simpler URLs. And having only one path for, say, the details about a tip, let us avoid making developers choose between multiple ways to do the same thing. As a bonus, this explicit, DRY approach to API design makes monitoring and debugging much easier.
Finally, we avoided using the extended set of HTTP verbs beyond the web’s idiomatic POST and GET, in favor of putting the action into the endpoint name, such as tips/add or tips/TIP_ID/markdone. Developers may be using stacks where these verbs are impossible or confusing, and if we supported a workaround for a lack of verb support, we’d be back at having two ways of doing things.
Generic structures and indirection
Verdict: good!
In APIv2, lists use a general structure of { count: X, items: [...] }. This lets us vary the amount of data we want to include in a given context (e.g. the number of tips in a venue response) without breaking clients. And it makes it easy for developers to understand how much data there may be to page in and to share list-handling code across endpoints.
Grouped lists extend this paradigm, where groups are represented via data, not structure. Rather than tips: { friends: [...], others: [...] }, we do tips: { count: X, groups: [ { name: friends, count: Y, items: [...]},.... ] }. This is more wordy, but it allows us to add and remove groups without breaking clients, especially in cases where the data is being pulled into objects where field names map to members. It also allows us to specify an order of groups and to structurally group sub-counts with sample items (e.g. the count of tips from my friends is next to tips from my friends).
In general, we tried to live by not using keys as data. Although it might be intuitive to say something like notifications: { mayor: { .. }, points: { .. } }, this structure suffers from the same problem as the groups above: it’s hard to add and remove groups without breaking consumers, and there’s no order. Instead, we opted for notifications: [ { type: “mayor”, .. }, .. ], which is much more flexible.
Finally, where possible, we use indirection to guard against the introduction of additional data. For example, rather than notification: { type: “special”, item: { ...specialdata... } }, we opted for notification: { type: “special”, item: { special: { ...specialdata... } } }. Although wordier, this format is much more explicit, and it allows us to add additional objects related to a given notification without perverting the form of a “special” object.
Documentation
Verdict: good!
Never underestimate the value of documentation; it’s been one of the largest sources of positive feedback, especially combined with the API explorer, which is built using standard OAuth2. To make it easier to write documentation given our itty bitty team, we hacked together a really simple documentation generation system in python that takes text files and spits out Lift templates, and it’s part of our main code repository. No engineer developing new endpoints has an excuse not to document them, and the documentation is the go-to source of information even for internal developers.
Timestamps as seconds since epoch
Verdict: good!
Not human-readable, but so easy to parse, and nobody has complained.
CamelCase
Verdict: meh
They look okay on field names, but they still look a little funny on parameter names, don’t they? It’s also a little lame that our endpoint names still aren’t camel case, as well as certain, uh, special field names.
Images
Verdict: meh
We had a lot of debates about representing images and surveyed a lot of other APIs. Sadly, we still ended up with slightly different treatments of representing the available sizes and filenames for photos, category icons, badge images, and point icons, depending on the number of sizes we anticipate offering, squareness of images, and our own evolving sense of what is easy to use. In general, the badge response seems the least wordy and most user-friendly way of representing a range of image sizes, and we’re using it where we can throughout the API.
Versioning
Verdict: jury’s still out
The hardest part of designing an API is that you’re stuck with a bunch of decisions you can’t take back easily. But in some cases we needed to take them back anyway, and, for those we came up with the idea of clients sending back a version like 20110708 and promising to be compatible with all changes up to the date. This can lead to some tricky coordination problems between internal test clients and new server releases, and it places a large burden on clients to avoid regressing, but it’s also very simple to explain, and we’re trying to keep the number of breaking changes really, really low. We do wish we’d gone out the gate with this versioning plan, since now we’re forced to break the “default” behavior at some point.
Category representation
Verdict: oops!
Some things require testing with a range of clients to really show their warts, and unfortunately our category representation was exercised too late. It quickly became clear that the way we represent categories in venue JSON is an awkward compromise between brevity and utility by excluding the category ids and icons of parents. For any application where you may want to roll up the category hierarchy, you’re stuck loading the whole hierarchy just to contextualize a small set of results. In the users/USER_ID/badges response, we tried out a sort of “foreign key” approach where part of the response contains IDs pointing back into a dictionary of details that is part of the same response. In hindsight, we should have considered using something like this in more places. Although less straightforward than a totally-inline API like Twitter’s, it still avoids the extra call overhead of a normalized API like Facebook’s.
Object consistency and level of detail
Verdict: meh
This is the more general case of the category problem. It’s always hard to decide how much detail to include in an object, and we largely let the requirements of our own clients dictate what to include where. In APIv1, it felt arbitrary which fields would be present in, say, a user, when seen in a friends list versus in a list of checkins. In APIv2, we tried to be more principled by describing “compact” and “full” JSON representations with a minimal number of optional fields, most of which were optional because of context or absence of data. This sometimes meant sending down extra data in a given context, but we felt it made it easier to consume on the client, especially if the client wanted to cache data.
However, we’re starting to hit the limits of this. Do we add tipCount to compact venues? Does that adversely impact server performance or client speed? Do users always include their twitter contact information, if visible? We’ve discussed adding a detail=high parameter, or something like Facebook’s fields=X,Y,Z parameter, but we’re not excited about the complexity that results.
Envelope
Verdict: good!
Our decision to wrap all responses in { meta: { … }, response: {...} } has been well received, as well as the error details that we pack into the meta block. The decision to only provide error messages intended for developers, plus an error type for developers to switch on, seems to be working relatively well.
Grouping versus ranking
Verdict: meh
Although grouping tips by self, friends, and others seemed reasonable given our UI treatment, this doesn’t really make sense in the long term if we start to more aggressively rank tips from your friends against other popular tips. In fact, we already threw out the grouping of venue search results.
Halllllllp
These are just some of the decisions that we made in building APIv2. Is our assessment correct? Are there other things you’re curious about? Hit us up in the comments.
Enjoy trying to elegantly trade off performance and ease of use in APIs? We’re hiring.
- Kushal Dave, foursquare engineer
Practical Data Storage: MongoDB at foursquare
Our head of server engineering, Harry Heymann, recently spoke at MongoNYC on “Practical Data Storage: MongoDB at foursquare.” Watch the presentation here, or, if you’d prefer, you can read through the slides.
Fun with MongoDB replica sets
Though we’ve already established how much we like MongoDB, we felt it appropriate to keep the love train going because of another great feature: MongoDB replica sets and how we’re using them at foursquare.
Replica sets are super useful because they provide automated replication and failover, even in sharded clusters. Previously our clusters were running master/slave replication, which worked well but required us to do a lot of manual work if we needed to fail over to a slave. It also required us to hack in our own solutions to make the slaves available for read-only queries. Replica sets directly addresses both of these problems.
Replica sets have been available since MongoDB 1.6, though some more interesting refinements were rolled out in 1.8, which made our transition to replica sets go a little more smoothly. What’s nice about them is their simplicity. As you’ll see, setting up automated failover for a sharded database cluster is as simple as starting up mongo server instances and entering some commands in the mongo shell. Even the migration path from master/slave replication is pretty straightforward.
We won’t go too much into how replica sets work. Instead, we’ll describe how we set up a few of our replica set configurations and the motivations behind them.
1. Straight-up replica set with arbiter node
We have a couple single-master/single-slave clusters that we converted to a replica set with a primary and secondary node plus an arbiter node. This is probably the least complex configuration for a replica set in our database setup. To do the conversion, we updated all the mongod configuration files with the following options:
fastsync = true
rest = true
The fastsync option allows nodes to come online quickly if they have data that is relatively up to date with the primary node (i.e. the oplog is large enough on the secondary node to have buffered writes from the primary while being restarted). The rest option enables a really cool admin UI for diagnostic purposes, which is useful if you’re running an external monitoring tool like Ganglia or Nagios.
We also added an ec2 instance dedicated to running mongods as arbiters. These mongod processes are lightweight and only participate in an election process. Though running a single ec2 instance for this purpose introduces a single point of failure, we felt that having to maintain just one node was a reasonable trade off for our less critical clusters.
After restarting the master mongod with the new configuration options, we executed the following commands in the mongo shell:
> rs.initiate() // initialize with a fresh replica set config
{
"info" : "Config now saved locally. Should come online in about a minute.",
"ok" : 1
}
auxdb:PRIMARY> rs.add("auxdb-1:27017") // add another node
{ "ok" : 1 }
auxdb:PRIMARY> rs.addArb("auxdb-arbiter:27017") // add the arbiter node
{ "ok" : 1 }
At this point, we update our app code to pass the host/port list of the replica set members to the MongoDB driver to configure the connection to the replica set. Note we specify port 27017 in the config; MongoDB assumes that port by default, but we specify it here for clarity. The driver automatically figures out which node in the set is the primary node should a failure occur. Magic!
2. Replica set with dedicated read slaves and single backup node
We have a few clusters that are particularly read-heavy. Replica sets give us read-scaling with the "slaveOk" option, which we can set on a per-connection basis with "mongo.slaveOk()". Secondary nodes receive read queries, but all writes still go to the primary node. In addition, we keep a single backup node and prevent it from receiving any read queries by configuring it as a hidden node. We do this so that new nodes can perform an initialSync from it without impacting production traffic.
Again, we update the configuration files, restart the mongods, and connect to the cluster with the mongo shell:
> rs.initiate()
{
"info" : "Config now saved locally. Should come online in about a minute.",
"ok" : 1
}
cluster2:PRIMARY> rs.add("db-1:27017")
{ "ok" : 1 }
cluster2:PRIMARY> rs.add("db-2:27017")
{ "ok" : 1 }
cluster2:PRIMARY> rs.add("db-3:27017")
{ "ok" : 1 }
cluster2:PRIMARY> rs.add({ "_id" : 4, "host" : "db-backup:27017", "priority" : 0, "hidden" : true })
{ "ok" : 1 }
db-backup:27017 is like a regular secondary node that receives updates from the primary node, but because it’s hidden it doesn’t receive queries and is ineligible to become primary should an election be triggered. New nodes that we bring online are configured like this:
3. Shard on top of replica set with dedicated read slaves
This is like the previously described setup except we back a shard with a replica set, and mongos is configured with the replica set host/port information. This setup combines some of MongoDB’s most powerful features in a fashion similar to how drives are organized in a RAID 1+0 array. We find that this configuration simplifies our operational management for our larger clusters and allows us to scale out horizontally with ease.
We had already created a sharded database configuration and set up mongos and mongod config servers. In addition, each shard was already running a master/slave setup. To convert these clusters to use replica sets, we just went through the steps in number 2 above for each node in the shard and reconfigured the mongos instance:
> db.shards.update({ _id : "shard0000" }, {"$set" : { "host" : "sharddb/shard-0-a:27017,shard-1-a:27017,shard-2-a:27017,shard-3-a:27017,shard-backup-a:27017"} })
One nice thing about this is our app driver config doesn’t need to change. mongos is smart enough to not only determine which node in the set is the primary node, but also find other nodes in the set that weren’t added to the shard configuration initially. We also set slaveOk() in our connection setup and fan out our reads across the nodes in the shard.
That’s it!
If this all looks simple, it’s because it is. We’ve been very pleased with how replica sets have smoothed out the bumps in handling operational emergencies as well as for general maintenance and scaling out. Our next step is to configure our replica sets for data center awareness, which is not fully supported but workable with the latest version of MongoDB. Yay Mongo!
- Leo Kim, foursquare engineer
Building a recommendation engine, foursquare style
Last summer, foursquare’s employee count had grown a bit beyond our office capacity (as we surged towards 20 employees) and we had people sitting in whatever open space we could find. We were split between floors, parked on folding tables, and crammed into couches and loveseats. In one of those seats, @anoopr was playing around with building a map showing interesting places, which we called “Explore.” In the ensuing eight months, we went through several iterations and evolutions to arrive at the recommendations engine we launched two weeks ago as part of the foursquare 3.0 update. This blog post describes the process we used to develop this system, and some interesting tidbits we found along the way.
After that initial discussion, we quickly set up an API endpoint for Explore and started adding and tweaking features. On the server side, there was a web-based test view directly hitting the API with a list of places and a map with pins. We generalized the structure coming through, and started throwing more and more interesting data at it. We started with “popular” – which, as you will see in a moment, is not the easiest thing to define – and tried to put in some time sensitivity. We iterated, quickly. We deployed new functionality multiple times a day and had meetings with the product team every few days to tweak our trajectory and revisit various concepts.
With the results we were seeing, we could already sense that Explore was going to become something awesome. To get some more live feedback, we built a mobile-web version that sensed location and had touch controls. By using mobile-web, we could update the code instantly and tweak often without having users install a new app. Foursquare employees started carrying it around with them and testing out searches to find lunch and dinner.
Our mobile web test client
At this point, it was time to build in some personalization into the algorithm. Collaborative filtering is a common way of personalizing content based on your check-in history or the check-ins of “people like you.” We used both, limiting to just people inside your friend graph for the latter. By then we had approximately 10M venues, which means that if we want to compute the similarity between every pair of venues, that is 100 trillion computations (which is a lot!). Luckily by this time we had a more sophisticated grasp on internal analytics (see @rathboma’s analytics blog post), so we had a Hadoop cluster setup as part of our Hive deployment. On top of this we layered a package called Mahout, which was relatively easy to modify in order to compute custom similarity scores. By setting some constraints on which scores were significant, it was possible to build the resulting similarity matrix in less than an hour on a 40-machine cluster. By extrapolating how long it would take with a straightforward score calculator I wrote in Scala, computing the full matrix would have measured into days, or even weeks.
One of the hardest parts of building this was determining what the algorithm should do. For example, take the “cold start” – the case where you don’t have enough data for a user to provide personalized results. If you want to order these to optimize the average rank in the list returned, you could simply sort by pure check-in count or a moving average of the same. With that method, you get some interesting places, but you don’t find some of the hidden gems like Alidoro or Curry-Ya, foursquare staff favorites. Restaurants with larger capacity and longer hours tend to outweigh others, despite the fact that they may be more interesting to the average user. Also, ranking against the user-based feedback (star ratings, for example) from external sources, runs into other problems, like relative differences between rankings. For example, someone could rank both Per Se and Shake Shack five stars, but are they really equal in weight? Something needed to be done to account for all of these deficiencies.
While we’re keeping the new “cold start” algorithm as part of our secret sauce, we wanted to give you a closer look into the data that fed the ranking. To get you started, see below for a graphic of 50 randomly sampled, highly-frequented venues from New York state plotted on two axes: number of unique users and average visits per user. You can think of the distance from the origin as the overall popularity. Notice how categories of venues tend to fall in different regions. As one would expect, offices are visited often by a relatively small number of users, and most restaurants are frequented by many people but tend to have a lower average. However, once you zoom into a specific type of venue, you can see the difference between “must see” places like Juniors and places like Meze Grill, a spot at which people tend to be repeat customers.
Although a big component of building a recommendation system is the math, another major component is telling users why they should go to a place. Here at foursquare, we believe very strongly in social encouragement; we tell you which of your friends went to a place and which places you have gone that people also visit. We also believe in the power of loyalty, which means we tell you how many times you have been to a place before. These reasons not only influence the ranking decisions we make but also supplement the experience, maybe sparking a conversation with your friend or remembering a place you haven’t been in a while.
Reasons backing up recommendations
The algorithms were a challenge to build and required significant research into both common and esoteric statistical methods. This was trumped only by the massive engineering effort we put in place to power this system. Aside from the pre-computed similarities, all of the retrieval, ranking, and rendering is done online with an average server response time of less than 100ms. Generally, our largest latencies come from database queries; here are a few that were especially tricky:
- Querying an initial corpus to rank – We retrieve a set of venues within the radius you specify as candidate recommendations. For performance reasons, we pull back a fixed number of entries (pro tip: this means that the smaller the radius you set, the more venues we look at per unit area, potentially giving you better results). Given these constraints, we need to have a way of querying our venue database for the top N values sorted by some criteria within a specific geocircle. Most databases simply cannot handle this (at our scale), as it first requires the ability to index geographically and then further requires a way to compound the geo index. Even scanning each record within the circle is prohibitively expensive. There aren’t many databases in the world that can do this, but after some tuning and custom logic, MongoDB is able to do it in less than 30ms for a 10mi radius in New York City (hundreds of thousands of venues) using some magic from the folks over at 10gen.
- Where have your friends been? – Pulling back the history of your friends’ check-ins at all of the candidate venues is a potentially gargantuan task. Imagine you have 50 friends and an initial corpus of 250 venues. If each friend has checked in twice, that’s a potential of 25,000 check-ins we could pull back. At our scale and query volume, this is impossible to process. To get around this, we built a special cache on top of our check-ins that we call the “intersection server” which essentially stores light weight aggregate data about the interaction between users and venues, including check-in count. Now we have an aggregate form, half as many potential records, and lighter weight documents, which are manageable to pull back given our performance constraints.
What’s next? We will continue to make improvements to the engine, open up to more platforms, add new features to drive the algorithm, gather feedback, and generally make it more awesome. We’ve gotten great feedback so far and are glad Explore is the tool so many of you already use to explore your city!
Justin Moore (@injust)
P.S. If you love data as much as we do and want to bring Explore to the next level, foursquare is hiring!
Objective-C Blocks in iOS 4.0
The foursquare 3.0 application for iOS is a major release for us, and marks the maturation of several major technical efforts. With so many new features in the works, we decided that it was time to make the leap to iOS 4.0 in order to better take advantage of the platform. Since less than 5% of our iPhone API calls are coming from iOS 3.x devices, we felt that the move would not be too painful for our users. Even so, increasing platform requirements warranted serious consideration.
The single most compelling reason for the upgrade is the addition of a new language feature called blocks. Known more traditionally as closures (almost synonymous with inner/anonymous/lambda functions), blocks allow the programmer to succinctly define an operation that encapsulates both actions and data. Closures are by no means a new concept, and have played a key role in the history of functional languages. Despite this, blocks were only added to Objective-C with the release of Mac OS X 10.6 in 2009.
Since Objective-C is an object-oriented language, it already supports a standard mechanism for encapsulating functions and data: the class. Anywhere that a programmer might use a closure, a class instance could suffice. For example, the standard callback pattern in Cocoa takes a function pointer, and a second context pointer of type id or void, through which any relevant data is passed. The new pattern takes a single block argument, which serves as both callback function and contextual data. Many debates have weighed the merits of objects versus closures; the bottom line is that these constructs are similar tools, originating from different traditions of language design.
One of the most thought-provoking comments I ever heard at WWDC came from a seriously bearded Apple compiler guru, during the question and answer session of a compiler talk: “We finally have closures in C!” The presentation had already made clear that the implementation of blocks in GCC was no simple task, but that comment (and particularly his word choice) made me think hard about why blocks are so valuable.
One reason is brevity. Object-oriented programming can be very verbose, and blocks allow programmers to define small pieces of logic flexibly, inline, and in a consistent manner. The growing popularity of dynamic scripting languages has demonstrated that the concise, flexible nature of functional programming need not be reserved for ‘academic’ languages. But why does brevity matter? Is it simply a stylistic concern?
Objective-C is a verbose language. Method names tend to be long but easy to read, because arguments are essentially labeled in every call. Less helpful is the boilerplate @ syntax required to define Objective-C classes. As a Cocoa programmer, I find myself more reluctant to write small, organized classes than when I work in C++ or Python. Blocks elegantly address the common case where we need to pass some structured data along with a callback, but don’t really need the formal class definition because we are only using the callback in one place. With blocks, the code becomes simpler, consolidated, and easier to update.
Take for example this simple iOS 3.x (pre-blocks) animation code from a view controller:
- (void)animateFadeOut {
[UIView beginAnimations:@"fadeOut" context:nil];
[UIView setAnimationDelegate:self];
[UIView setAnimationDidStopSelector:@selector(fadeOutDidStop:finished:context:)];
[UIView setAnimationDuration:0.3];
self.loadingView.alpha = 0.0;
[UIView commitAnimations];
}
- (void)fadeOutDidStop:(NSString *)animationID finished:(NSNumber *)finished context:(void *)context {
[self.loadingView stopAnimating];
[self.loadingView removeFromSuperview];
}
With blocks, the preferred pattern becomes:
- (void)animateFadeOut {
[UIView animateWithDuration:0.3
animations:^{ self.loadingView.alpha = 0.0; }
completion:^(BOOL finished) {
[self.loadingView stopAnimating];
[self.loadingView removeFromSuperview];
}];
}
The callback disappears, and the entire operation now appears in one line. Notice also how the new method does away with the need for a number of calls to UIView. Fairly trivial, but now imagine that the callback needs to perform a more complex action. For example, suppose we have several objects loading, and on completion of any given object, want to perform two actions: remove the appropriate loading graphic, and enable a button.
In 3.x, we would need to pass a context containing pointers to the relevant views, contained in either a dictionary or some other container instance. Suddenly, focus has shifted to defining and constructing the context. In 4.0, on the other hand, we simply reference the objects in question inside the block, thereby defining the ‘context’:
- (void)animateFadeOutWithLoadingView:(UIView *)currentLoadingView button:(UIButton *)currentButton {
[UIView animateWithDuration:0.3
animations:^{ self.loadingView.alpha = 0.0; }
completion:^(BOOL finished) {
[currentLoadingView stopAnimating];
[currentLoadingView removeFromSuperview];
currentButton.enabled = YES;
}];
}
The beauty of this example is that we need not worry about any retain/release semantics that might complicate the lifetime of a 3.x-style context. Instead, the block ‘closes over’ all the variables that it references, and the run-time retains them as necessary. When the application completes the animation, it disposes of the block, and the references it contains. This ‘lexical scoping’ is what makes closures so useful.
For problems with more complex memory semantics, the clarity that blocks offer becomes a substantial benefit. The feature was announced in conjunction with Grand Central Dispatch, Apple’s general purpose concurrency framework, and it is in this realm that blocks really shine. Writing safe, asynchronous code is difficult, in part because tracking the state of each context in flight can become a major bookkeeping effort. The simplicity and ad-hoc encapsulation that blocks provide will aid tremendously as we move into the era of multicore mobile computing.
In addition to blocks, iOS 4.0 offers several other features that improve the brevity of code. The redundancy of instance variables and properties is fading away, and class extensions can now declare private members. Regular expressions, long absent from Cocoa, have finally surfaced, replacing NSScanner for many tasks. The trend towards more dynamic patterns continues, and I can’t help but wonder how long it will be before garbage collection and MacRuby make the leap from the Macbook to the iPhone. In any case, we hope that you enjoy the new release, and look forward to improving the foursquare experience even more on iOS 4!
- George King, iOS Engineer
@georgewking
How we found the rudest cities in the world – Analytics @ foursquare
With over 400 million check-ins in the last year, it’s safe to say that our servers log a lot of data. We use that data to do a lot of interesting analysis, from finding the most popular local bars in any city, to recommending people you might know, and even for drawing pretty pictures. However, until recently, our data was only stored in production databases and log files. Most of the time this was fine, but whenever someone non-technical wanted to do some data exploration, it required them knowing scala and being able to query against production databases.
This has become a larger problem as of late, as many of our business development managers, venue specialists, and upper management eggheads need access to the data in order to inform some important decisions. For example, which venues are fakes or duplicates (so we can delete them), what areas of the country are drawn to which kinds of venues (so we can help them promote themselves), and what are the demographics of our users in Belgium (so we can surface useful information)?
In short, without easy access to this data, we are not able to make smart decisions at any level of the company.
Thus we needed two things:
- A set of high-level scheduled reports to inform general business decisions.
- A way for anyone in the company to do data-exploration without hurting our production systems or learning about scala, sbt, ssh, and mongo.
The Solution
We decided to use Apache Hadoop, and Apache Hive in combination with a custom data server (built in Ruby), all running in Amazon EC2.
For those who don’t know, Hadoop is an open-source Map-Reduce framework for parallel data processing, and Hive is a secondary service that allows you to interact with Hadoop by defining ‘virtual’ tables and using familiar SQL syntax.
The data server is built using Rails, MongoDB, Redis, and Resque and communicates with Hive using the ruby Thrift client.
We all like pictures, so here is a diagram:
The idea is simple: we run our own ‘data server’ to act as a gateway to reports. This allows us to:
- Provide an easy-to-use end-point to run data exploration queries (using SQL and simple web-forms).
- Cache the results of queries (in a database) to power reports, so that the data is available to everyone, whenever it is needed.
- Allow our hadoop cluster to be totally dynamic without having to move data around (we shut it down at night and on weekends).
- Add new data in a simple way (just put it in Amazon S3!).
- Analyse data from several data sources (mongodb, postgres, log-files).
Importing Data
The last two points are very important. In fact a large portion of the data-server’s code base is dedicated to data cleaning and importing. We found it best to represent all data in tab-delimited flat files. To turn mongo/log/postgres/json data into this format, each ‘table’ has a specification written in ruby. Here is a simple example:
include Foursquare::LocationLookup
mapped_attributes :id, :venue, :shout, :lat, :long
mapped_attributes :country, :state, :timezone
end
id: ‘123’,
venue: ‘456’,
shout: ‘ayup mum!’,
ll:’24.5,-50.4’,
something_else: ‘boo!’
}”
checkin = Checkin.new(data)
So now:
=> 123 456 ayup mum! 24.5 -50.4 us New York America/New_York
The initialize method provided by Foursquare::MappedClass can interpret several data types, in this example JSON is used. By including the LocationLookup module, country, state, and timezone can be automatically added if a lat/long field exists (using a local Mongodb database). For all such transformations, tabs, newlines, and excess white-space are removed from field values to ensure that each record occupies only a single line.
We have rake tasks to run this as either a simple script, or as part of a hadoop streaming job.
Running Queries
Because we’re storing data away from the production system, we can run queries that generate 1,000,000,000,000 records if they want to (I’m looking at you @injust), and the system simply emails them a link to their results when the query has finished (so they don’t have to wait around). In fact we can run all sorts of cool stats.
A Fun Example
Lets say we want to find the city with the rudest citizens, judged by how often a tip left in that city contains a curse word. We could run this query:
v.city,
v.state,
sum(curse) AS curses,
sum(any) AS any_tip,
sum(curse)/sum(any) AS percentage
FROM
(
SELECT
venueid,
IF(text LIKE ‘%curseword_here%’, 1, 0) AS curse,
1 AS any
FROM tips
) tips
JOIN venues v ON tips.venueid = v.id
GROUP BY v.city, v.state
SORT BY percentage DESC
After 5 minutes of waiting, we have a list of top 20 offenders (highest % of tips containing curse words):

(I’ve filtered out cities that had less than 1000 tips total.)
Its good to see that the Mancunians truly are not only the rudest people in the UK, but the rudest people globally, only El Paso comes close. Although please keep in mind that this only evaluates the rudeness of English speaking countries (like that would make a difference?).
In Summary
Amazon’s Elastic MapReduce plus a simple Ruby on Rails server can make a powerful (and cheap) data analysis tool. By reducing the barrier to data-exploration we have been able to inform better business decisions, and even create a little fun.
- Matthew Rathbone, Foursquare Engineer (and a proud British midlander)
MongoDB strategies for the disk-averse
Behind the scenes at foursquare, we have a lot of data collection efforts that present interesting scaling puzzles. One is the venue metrics system, which allows business owners to get information about checkins to their venue over time. It lets them see the effect of specials, understand their clientele’s demographics, and even identify their most loyal customers.
To store this data, we need to handle tens of writes per second across millions of venues, interleaved with infrequent reads of the last 90 days of data for a given venue. Ideally, reads would return within a second or two. This isn’t inherently hard, but we really want to minimize the resources dedicated to this small corner of our system.
We’re fans of MongoDB, and one natural way to store this information would be to have one document for every active (venue, hour) pair which contains various counters (men, women, etc.). We would take each new checkin and increment appropriate counters in the (venue, hour) record, creating the record if it doesn’t already exist. But because new records are just appended to the end of a memory-mapped file, every record about a venue would end up being on a different page amid records about other venues that just happened to be created at the same time.
If we hold all of the data live in RAM, this is no big deal. But as soon as we need to read from disk, every non-contiguous record translates into a page fault, which can be destructive to performance. To load 3 months of data, we’d need to fetch 2,000 hourly records, which would take tens of seconds if every record were on a different page and incurred page fault and disk seek overhead. (On our XXL EC2 machine with 4 EBS volumes in RAID0 on ext4, we seem to max out at 50-ish page faults per second.)
Locality, locality, locality
A lot of other database systems provide ways to establish disk locality. Oracle offers Index Organized Tables, MySQL has InnoDB, Tokyo Cabinet has its B-Tree store, and Bigtable stores data in range order. Unfortunately, these all come at the expense of write performance as data is rearranged to make room for new records (with the exception of Bigtable, which uses compactions to mitigate this effect).
MongoDB offers two options that address this, but neither is quite right for our application.
- MongoDB’s collections feature lets us encourage some amount of disk locality. Every page in a database file is allocated to only one collection, so we would have the minimum possible page faults to load our data. They also reduce index size by pulling some information out of it. But although the number of collections is configurable to be higher than the default of 24,000, there’s still a hard limit well below millions.
- MongoDB lets us grow an existing document, so that, for example, a given venue’s visits in a given month could be a single array that we keep appending to. Unfortunately, if the record exceeds available padding, the entire record needs to be rewritten to a new location, leading to some very expensive writes and fragmentation on disk. MongoDB determines how much padding to allocate by watching how records grow in a given collection, but if some records grow a lot and some not at all, this is hard to optimize. We’re experimenting with growing documents for check-in comments, but they simply did not provide the insertion rate we needed when trying to backfill existing stats data.
In the end, we settled on building locality at the application level. Every time we need to record the data for a given venue at a given hour, we align it to a five-hour period. We insert a record with a 0’ed out 5-element array (which will fail if that record already exists) and then update into the appropriate position. Roughly:
// It represents the 5 hours starting at TIMESTAMP
db.timeSeries.insert({
_id: {v:VENUE_ID, t: TIMESTAMP} vals: [0,0,0,0,0]})
db.timeSeries.update({
_id: {v: VENUE_ID, t: TIMESTAMP}, { $inc: { ‘vals.1’: 1} })
They to query a range of this data:
'$gte': _id: {v: VENUE_ID, t: TIMESTAMP_1},
'$lte': _id: {v: VENUE_ID, t: TIMESTAMP_2} }})
Grouping every 5 hours of data adds some complexity to our application, but it takes a fifth as many disk seeks as storing each hour separately. How did we pick 5 hour chunks? Based on trial and error, this seemed to be a sweet spot given the sparseness of visits to venues, especially given that arrays in BSON are implemented as dictionaries and not quite as compact as traditional arrays.
Serialization and its disk-contents
There are a few other cool tricks we play in other parts of this data store:
- Compound keys. As hinted above, MongoDB allows _id’s to be objects. This allows us to save on the number of indices we need. For example, to record the number of unqiue visitors in a given hour, we have a collection where all of the data lives in the _id, which is of the form { v: VENUE_ID, t: HOUR_SINCE_EPOCH, u: USER_ID }. The default index on _id already allows us to find all records matching a venue ID or a venue ID in a given time range. We have to be careful to use ordered dictionaries, provided by the SON libraries in PyMongo (equivalent features exist in other languages), not regular dictionaries. The order has to be consistent every time we insert or update and reflect the way in which we intend to query.
- Covering indices. As of MongoDB 1.7.3, queries that can be answered completely by the index do not hit disk at all to find the document, assuming the index fits in RAM. In the case of the hours-uniques collection described above, the documents are empty and never loaded on queries.
- Small keys. An oldie-but-goodie: we use small key names. Every key is repeated in every record and can be a significant part of record size if we’re just storing numbers. We keep a mapping from single letters to human-readable names in our application code.
- Use the right type. Similar to small keys, storing numbers as numbers and object IDs as object IDs can produce significantly smaller data than playing tricks with strings.
We also toyed with splitting our data into separate databases by month. This would let us roll off old months of data easily, reducing both active index size and data size on disk. Capped collections could also achieve the same effect. Neither solution felt urgent, and we’re doing just fine keeping hundreds of gigabytes of data live with only 64GB of RAM.
In doing all of this, mongostat and iostat were our friends. Among other things, they helped us realize that covering indices weren’t implemented yet in Mongo 1.6 and that the last point release of EC2’s official Red Hat AMI actually could not address more than 32 GB of physical RAM.
Disk-saster averted!
One of the best things about MongoDB is that it’s very easy to reason about, making it easy to define performant schemas, rather than fiddling endlessly with confusing tuning parameters. Amazon’s EBS has abysmal I/O performance, even after RAID0 and ext4. But by understanding a little bit more about how MongoDB pages data off of disk, it’s still possible to squeeze out something reasonable.
- Kushal Dave, foursquare engineer




2 Comments