The Mathematics of Gamification

At Foursquare, we maintain a database of 60 million venues. And like the world it represents, our database is ever-changing, with users from all over the world submitting updates on everything from the hours of a restaurant to the address of a new barbershop. To maintain the accuracy of our venue database, these changes are voted upon by our loyal Superusers (SUs) who vigilantly maintain a watchful eye over our data for their city or neighborhood.

Like many existing crowd-sourced datasets (Quora, Stack Overflow, Amazon Reviews), we assign users points or votes based on their tenure, reputation, and the actions they take. Superusers like points and gamification. It rewards diligent, hard-working SUs (which are the majority) and punishes the few malicious “bad players.” But data scientists like probabilities and guarantees. We’re interested in making statements like, “we are 99% confident that each entry is correct.” How do we allocate points to users in a way that rewards them for behavior but allows us to make guarantees about the accuracy of our database?

At Foursquare, we have a simple, first-principles based method of resolving proposed venue attribute updates. We can gauge each Superuser’s voting accuracy based on their performance on honeypots (proposed updates with known answers which are deliberately inserted into the updates queue). Measuring performance and using these probabilities correctly is the key to how we assign points to a Superuser’s vote.

The Math

Let’s make this more concrete with some math. Let $H_0$ denote the true state of the world, either $1$ or $-1$, which we can interpret as a proposed update being true or false, respectively. We do not observe this but we know $H_0 = 1$ with a-priori probability $p_0$. User 1 votes $H_1$ (again, either $1$ or $-1$, representing “yay” or “nay”) with independent probability $p_1$ of agreeing with the truth $H_0$ and $(1-p_1)$ of disagreeing. Bayes’ Rule then gives us
\P(H_0 = 1 | H_1 = 1) & = \frac{\P(H_1 = 1 | H_0 = 1) \P(H_0 = 1)}{\P(H_1 = 1)} \\
& = \frac{p_0 p_1}{p_0 p_1 + (1-p_0)(1-p_1)} \\
& = \frac{\ell_0 \ell_1}{\ell_0 \ell_1 + 1}
\end{align*}where we have written the solution in terms of the likelihood ratio $\ell_k = \ell(p_k)$ given by
\[ \ell(p) = \frac{p}{1-p} \qquad \ell^{-1}(\cdot) = \frac{\cdot}{1 + \cdot}\,. \]Then we have that
\[ \P(H_0 = 1 | H_1 = 1) = \ell^{-1}(\ell(p_0) \ell(p_1))\,. \]In fact, it is easy to see that in the general case,
\[ \P(H_0 = 1 | H_1 = h_1) = \ell^{-1}(\ell(p_0) \ell(p_1)^{h_1})\,. \]Multiplication is hard so we will define the logit or log-likelihood function
\[ \logit: (0,1) \to (-\infty, \infty) \]given by
\[ \logit(p) = \log(\ell(p)) = \log\prn{\frac{p}{1-p}} \qquad \logit^{-1}(\cdot) = \frac{e^{\cdot}}{1 + e^{\cdot}}\,. \]Then we have
\[ \P(H_0 = 1| H_1 = h_1) = \logit^{-1}(\logit(p_0) + h_1 \logit(p_1))\,. \]

Continuing, assume that after user 1 casts their vote, user 2 votes $H_2$ with an independent probability $p_2$ of being correct (i.e. agreeing with $H_0$). We can think of the posterior probability $\P(H_0 = 1| H_1 = h_1)$ as our new prior and inductively repeat the above Bayesian analysis to obtain
\P(H_0 = 1 | H_1 = h_1, H_2 = h_2) & = \logit^{-1}\prn{\logit\prn{\P(H_0 = 1 | H_1 = h_1)} + h_2 \logit(p_2)} \\
& = \logit^{-1}(\logit(p_0) + h_1 \logit(p_1) + h_2 \logit(p_2))\,.
\end{align*}In fact, if we have $n$ votes $H_1, \ldots, H_n$, then we have
\begin{align} & \P(H_0 = 1| H_1 = h_1, \ldots, H_n = h_n) \nonumber\\
& \qquad\qquad = \logit^{-1}\prn{ \logit(p_0) + \sum_{k=1}^n h_k \logit(p_k) } \,. \label{eq:main}

The Solution

The above equation suggests that we should assign $s_k$ points or votes to user $k$ based on \begin{equation} s_k = \logit(p_k)\,. \label{eq:points} \end{equation} We can add up all the “yay” votes and subtract all the “nay” votes to obtain a score for the update. This score can easily be interpreted as a probability that the update is correct. We can set a certainty threshold $p$ (e.g. $p = 99\%$) as a threshold for a desired accuracy of this edit. Then, we accept a proposed edit as soon as \begin{equation} \logit(p_0) + \sum_{k=1}^n h_k \logit(p_k) \ge \logit(p) \label{eq:upper} \end{equation} and reject it as soon as \begin{equation} \logit(p_0) + \sum_{k=1}^n h_k \logit(p_k) \le – \logit(p)\,. \label{eq:lower} \end{equation}

In other words, if we take $t = \logit(p)$ to the the points threshold and $s_0 = \logit(p_0)$ to be the points allocated to a new proposed edit, then \eqref{eq:upper} and \eqref{eq:lower} become
\[ s_0 + \sum_{k=1}^n h_k s_k \ge t \]and
\[ s_0 + \sum_{k=1}^n h_k s_k \le – t\,, \]which are exactly the equations for voting you would expect. But now, they’re derived from math!

The Benefits

  • Efficient, data-driven guarantees about database accuracy. By choosing the points based on a user’s accuracy, we can intelligently accrue certainty about a proposed update and stop the voting process as soon as the math guarantees the required certainty.
  • Still using points, just smart about calculating them. By relating a user’s accuracy and the certainty threshold needed to accept a proposed update to an additive point system \eqref{eq:points}, we can still give a user the points that they like. This also makes it easy to take a system of ad-hoc points and convert it over to a smarter system based on empirical evidence.
  • Scalable and easily extensible. The parameters are automatically trained and can adapt to changes in the behavior of the userbase. No more long meetings debating how many points to grant to a narrow use case.
    So far, we’ve taken a very user-centric view of $p_k$ (this is the accuracy of user $k$). But we can go well beyond that. For example, $p_k$ could be “the accuracy of user $k$’s vote given that they have been to the venue three times before and work nearby.” These clauses can be arbitrarily complicated and estimated from a (logistic) regression of the honeypot performance. The point is that these changes will be based on data and not subjective judgments of how many “points” a user or situation should get.

Some practical considerations:

  • In practice, we might want a different threshold for accepting \eqref{eq:upper} versus rejecting \eqref{eq:lower} a proposed edit.
  • For notational simplicity, we have assumed that a false positives and false negatives in user $k$’s voting accuracy have the same probability $p_k$. In general, this is not the case. We leave it to the reader to figure the math of the general case.
  • Users like integer points. We have to round $s_k$ to the nearest integer. Because we can multiply linear equations like \eqref{eq:upper} and \eqref{eq:lower} by a positive constant, we can set $s_k = [\alpha \cdot \logit(p_k)]$ where $[\cdot]$ is the rounding function and $\alpha$ is a large positive constant. A large $\alpha$ will prevent the loss of fidelity.
  • We’ve explained how to obtain $p_1, p_2, \ldots$ from honeypots but how do we obtain $p_0$, the accuracy of newly proposed updates. One way is to use the above to bootstrap those accuracies from voting: we can use this voting technique to infer the accuracy of proposals by looking at what fraction of proposed updates are accepted!
  • Bayesian Smoothing. We assume a relatively low-accuracy prior for the accuracy of individuals. This is a pessimistic assumption that keeps new, untested users from having too much influence. It also rewards users for lending their judgment and casting votes as long as those are more accurate than our pessimistic prior. Of course, we also increase the likelihood of showing new Super Users honeypots to give them a chance to prove themselves.

–Michael Li
Data Scientist