A Hackday Project: What neighborhood is the ‘East Village’ of San Francisco?
Have you ever wondered what’s the equivalent of your neighborhood in another city? How you’d find the Times Square of Tokyo? The Beverly Hills of Dallas? Or the East Village of San Francisco? For a hackday project this January, we mapped our 1,500,000,000 check-ins to 140,000 neighborhoods all over the world to better understand and compare the different places we live, work, and play. Here is a brief account of our hack.
First, to collect data about neighborhoods, we built some Hive queries to access our large collection of check-ins (stored in S3) and count the number of check-ins per category for every neighborhood in the world. For example, the East Village of New York has 230k check-ins at bars, 57k check-ins at pizza places, 18k check-ins at yoga studios, and 34k check-ins at karaoke places (hipsters like to sing!).
We then used MATLAB to represent each neighborhood by a 400 dimensional vector which specifies the normalized probability distribution of checking in to a place in each category relative to the baseline distribution of the city. This approach allows us to compare neighborhoods with each other using a similarity metric such as cosine similarity, or KL divergence.
Here is a visualization of the similarity matrix for NYC neighborhoods. The blue entries indicate two neighborhoods are very similar, and the red entries indicate neighborhoods most different. The ordering is determined by a k-means clustering on the data, meaning similar neighborhoods will be ordered close to each other. Looking along the diagonal of this matrix we see groups of places which are very similar to each other such as the east village, the lower east side, and alphabet city.
(click the image for full-size)
It turns out that a good proxy for describing a neighborhood is the proportion of activities that go on inside it. For example, imagine if two neighborhoods both have lots of check-ins at apartments, colleges, and food trucks (think college towns). Those two neighborhoods are more similar than a neighborhood that has tons of check-ins at offices and retail stores.
Here are some top categories based on neighborhoods:
Soho, New York: clothing stores, offices, electronics stores, coffee shops, French restaurants
Mission, San Francisco: Mexican restaurants, bars, coffee shops, burrito places
Kendall Square, Boston: offices, food trucks, tech startups, college academic buildings, sandwich places
Hollywood, LA: nightclubs, multiplexes, burger joints, hotels, bars
At this point in the hack day, we shared this data with the rest of the company so people could explore their own neighborhoods. foursquare HQ had just moved from the East Village to Soho, and the whole office was eager to see a heads-up comparison. So we put together a quick website using Ruby, Sinatra, Twitter Bootstrap, and the d3.js library for visualization. This allowed us to better visualize pairwise comparisons of neighborhoods and to easily click through the whole dataset.
Here is a visualization of the differences between the East Village to Soho:
We see that Soho has a lot of activity at offices and clothing stores, whereas the East Village has a lot of activity at bars and pizza places.
We also can now algorithmically compare neighborhoods across different cities:
Most similar to NY’s East Village in San Francisco:
Most similar to SF’s Chinatown in NY:
Chinatown — obviously
Long Island City
Most similar to Seattle’s Capitol Hill in SF:
Most similar to NY’s Coney Island in Orlando:
Walt Disney World Resort
Sea World Theme Park
Inherent in Foursquare’s 1,500,000,000 check-ins is a staggering amount of information about the characteristics of cities. It is now possible to quantify and measure the ways people interact with neighborhoods at a higher resolution than ever before. This whole hackday project was achievable in just a day and a half by two engineers because of the amazing data, infrastructure, and tools provided at Foursquare. There are many possible directions this project can go; for example, we’re looking forward to including user demographics and time-based information into the model. If you have some good ideas for what to try next, please leave them in the comments, or better yet, join us and try them yourself!