 # Bayesian A/B Tests

Here at RichRelevance we regularly run live tests to ensure that our algorithms are providing top-notch performance.  Our RichRecs engine, for example, displays personalized product recommendations to consumers, and the \$\$A/B\$\$ tests we run pit our recommendations against recommendations generated by a competing algorithm, or no recommendations at all.  We test for metrics like click-through-rate, average order value, and revenue per session.  Historically, we have used null hypothesis tests to analyze the results of our tests, but are now looking ahead to the next generation of statistical models.  Frequentist is out, and Bayesian is in!

Why are null hypothesis tests under fire?  There are many reasons [e.g. here or here], and a crucial one is that null hypothesis tests and p-values are hard to understand and hard to explain.  There are arbitrary thresholds (0.05?) and the results are binary – you can either reject the null hypothesis or fail to reject the null hypothesis.  And is that what you really care about? Which of these two statements is more appealing:

(1) “We rejected the null hypothesis that \$\$A = B\$\$ with a p-value of 0.043.”

(2) “There is an 85% chance that \$\$A\$\$ has a 5% lift over \$\$B\$\$.”

Bayesian modeling can answer questions like (2) directly.

What’s Bayesian, anyway?  Here’s a short but thorough summary [source]:

The Bayesian approach is to write down exactly the probability we want to infer, in terms only of the data we know, and directly solve the resulting equation […] One distinctive feature of a Bayesian approach is that if we need to invoke uncertain parameters in the problem, we do not attempt to make point estimates of these parameters; instead, we deal with uncertainty more rigorously, by integrating over all possible values a parameter might assume.

Let’s think this through with an example.  Assume your parameter-of-interest is click-through rate (CTR), and your \$\$A/B\$\$ test is pitting two different product recommendation engines against one another.  With null hypothesis testing, you assume that there exist true-but-unknown click-through rates for \$\$A\$\$ and \$\$B,\$\$ which we will write as \$\$text{CTR}_A\$\$ and \$\$text{CTR}_B,\$\$ and the goal is to figure out if they are different or not.

With Bayesian statistics we we will instead model the \$\$text{CTR}_A\$\$ and \$\$text{CTR}_B\$\$ as random variables, and specify their entire distributions (I’ll go through this example in more detail in the next section).  \$\$text{CTR}_A\$\$ and \$\$text{CTR}_B\$\$ are no longer two numbers, but are now two distributions.

Here’s a quick dictionary of Bayesian terms:

• prior – a distribution that encodes your prior belief about the parameter-of-interest
• likelihood – a function that encodes how likely your data is given a range of possible parameters
• posterior – a distribution of the parameter-of-interest given your data, combining the prior and likelihood

So forget everything you know about statistical testing for now.  Let’s start from scratch and answer our customer’s most important question directly: what is the probability that \$\$text{CTR}_A\$\$ is larger than \$\$text{CTR}_B\$\$ given the data from the experiment (i.e. a sequence of 0s and 1s in the case of click-through-rate)?

To compute this probability, we’ll first need to find the joint distribution (a.k.a. the posterior):

\$\$!P(text{CTR}_A,text{CTR}_B|text{data}),\$\$

and then integrate across area-of-interest.  What does that mean?  Well, \$\$P(text{CTR}_A,text{CTR}_B|text{data})\$\$ is a two-dimensional function of \$\$text{CTR}_A\$\$ and \$\$text{CTR}_B.\$\$  So to find \$\$P(text{CTR}_A>text{CTR}_B|text{data})\$\$ we have to add up all the probabilities in the region where \$\$text{CTR}_A>text{CTR}_B\$\$:

\$\$!P(text{CTR}_A > text{CTR}_B|text{data}) = iintlimits_{text{CTR}_A > text{CTR}_B} P(text{CTR}_A,text{CTR}_B|text{data}) dtext{CTR}_A dtext{CTR}_B.\$\$

To actually calculate this integral will require a few insights.  The first is that for many standard \$\$A/B\$\$ tests, \$\$A\$\$ and \$\$B\$\$ are independent because they are observed by non-overlapping populations.  Keeping this in mind, we have:

\$\$!P(text{CTR}_A,text{CTR}_B|text{data}) = P(text{CTR}_A|text{data}) P(text{CTR}_B|text{data}).\$\$

This means we can do our computations separately for \$\$text{CTR}_A\$\$ and \$\$text{CTR}_B\$\$ and then combine them at the very end to find the probability that \$\$text{CTR}_A > text{CTR}_B.\$\$  Then, applying Bayes rule to both \$\$P(text{CTR}_A|text{data})\$\$ and \$\$P(text{CTR}_B|text{data}),\$\$ we get:

\$\$!P(text{CTR}_A,text{CTR}_B|text{data}) = frac{P(text{data}|text{CTR}_A)P(text{CTR}_A) P(text{data}|text{CTR}_B) P(text{CTR}_B)}{P(text{data})P(text{data})}.\$\$

The next step is to define the models \$\$P(text{data}|cdot)\$\$ and \$\$P(cdot).\$\$  (We don’t need a model for \$\$P(text{data})\$\$ because, in practice, we’ll never have to use it to compute the probabilities of interest.)  The models are different for every type of test, and the simplest is…

### Binary A/B Tests

If your data is a sequence of 0s and 1s, a binomial coin-flip model is appropriate.  In this case we can summarize each side of the test by the parameters \$\$text{CTR}_A\$\$ and \$\$text{CTR}_B,\$\$ where \$\$text{CTR}_A\$\$ is the probability of a 1 on the \$\$A\$\$ side.

We’ll need some more notation.  Let \$\$text{clicks}_A\$\$ and \$\$text{views}_A\$\$ be the number of clicks and the total number of views, respectively, on the \$\$A\$\$ side.  The likelihood is then:

\$\$!begin{align*}P(text{data}|A) &= P(text{views}_A, text{clicks}_A | text{CTR}_A )\\&= {text{views}_A choose text{clicks}_A} text{CTR}_A^{text{clicks}_A} left(1-text{CTR}_Aright)^{text{views}_A-text{clicks}_A},end{align*}\$\$

with a similar looking equation for the \$\$B\$\$ side.  Choosing the prior \$\$P(text{CTR}_A)\$\$ is a bit of a black art, but let’s just use the conjugate Beta distribution for mathematical & computational convenience (see here and here for more about conjugate priors).  Also, for the sake of fairness, we will use the same prior for \$\$text{CTR}_A\$\$ and \$\$text{CTR}_B\$\$ (unless there is a good reason to think otherwise):

\$\$!begin{align*}P(text{CTR}_A) &= text{Beta}(text{CTR}_A;alpha,beta)\\&= frac{1}{B(alpha,beta)}, text{CTR}_A^{alpha-1}(1-text{CTR}_A)^{beta-1},end{align*}\$\$

where \$\$B\$\$ is the beta function (confusingly, not the same as a Beta distribution), and \$\$alpha\$\$ and \$\$beta\$\$ can be set to reflect your prior belief on what \$\$text{CTR}\$\$ should be.  Note that \$\$P(text{CTR}_A)\$\$ has the same form as \$\$P(text{views}_A, text{clicks}_A | text{CTR}_A )\$\$ – that’s precisely the meaning of conjugacy – and we can now write the posterior probability directly as:

\$\$!begin{align*}P(text{CTR}_A |text{views}_A, text{clicks}_A) &= frac{P(text{views}_A, text{clicks}_A | text{CTR}_A ) p(text{CTR}_A)}{P(text{views}_A,text{clicks}_A)}\\&= {text{views}_A choose text{clicks}_A} frac{text{CTR}_A^{text{clicks}_A + alpha -1} left(1-text{CTR}_Aright)^{text{views}_A-text{clicks}_A + beta – 1}}{P(text{views}_A,text{clicks}_A)B(alpha,beta)}\\&propto text{Beta}(text{CTR}_A; text{clicks}_A + alpha, text{views}_A – text{clicks}_A + beta).end{align*}\$\$

(In practice it doesn’t really matter what prior we choose – we have so much experimental data that the likelihood will overwhelm the prior easily.  But we chose the Beta prior because it simplifies the math and computations.)

Now we have two Beta distributions the product of which is proportional to our posterior – what’s next?  We can numerically compute the integral we wrote down earlier!  In particular, let’s find

\$\$!P(text{CTR}_A > text{CTR}_B | text{data} = { text{views}_A, text{clicks}_A, text{views}_B, text{clicks}_B}).\$\$

To do so just draw independent samples of \$\$text{CTR}_A\$\$ and \$\$text{CTR}_B\$\$ (Monte Carlo style) from \$\$!text{Beta}(text{CTR}_A; text{clicks}_A + alpha, text{views}_A – text{clicks}_A + beta)\$\$ and \$\$!text{Beta}(text{CTR}_B; text{clicks}_B + alpha, text{views}_B – text{clicks}_B + beta)\$\$ as follows (in Python):

```from numpy.random import beta as beta_dist
import numpy as np
N_samp = 10000 # number of samples to draw
clicks_A = 450 # insert your own data here
views_A = 56000
clicks_B = 345 # ditto
views_B = 49000
alpha = 1.1 # just for the example - set your own!
beta = 14.2
A_samples = beta_dist(clicks_A+alpha, views_A-clicks_A+beta, N_samp)
B_samples = beta_dist(clicks_B+alpha, views_B-clicks_B+beta, N_samp)```

Now you can compute the posterior probability that \$\$text{CTR}_A > text{CTR}_B\$\$ given the data simply as:

`np.mean(A_samples > B_samples)`

Or maybe you’re interested in computing the probability that the lift of \$\$A\$\$ relative to \$\$B\$\$ is at least 3%.  Easy enough:

`np.mean( 100.*(A_samples - B_samples)/B_samples > 3 )`

Pretty neat, eh?  Stay tuned for the next blog post where I will cover Bayesian A/B tests for Log-normal data!

PS, How should you set your \$\$alpha\$\$ and \$\$beta\$\$ in the Beta prior?  You can set them both to be 1 – that’s like throwing your hands up and saying “all values are equally likely!”  Alternatively you can set \$\$alpha\$\$ and \$\$beta\$\$ such that the mean or mode of the Beta prior is roughly where you expect \$\$CTR_A\$\$ and \$\$CTR_B\$\$ to be.

Reference: Bayesian Data Analysis, Chapter 2

Share :