Be A/Bayesian: An exploration of Bayesian A/B testing

16 minute read

Published:

Introduction

After dabbling in the world of large language models in my last two blog posts (the first on attention in transformers; the second on training a full GPT), I needed some respite from the world’s current hypetrain and so decided to return to my statistical roots. In my day job working as a statistician on difficult applied problems in public policy and business strategy settings, I more often than not adopt a Bayesian approach to statistical inference. I do this for a variety of reasons:

  • More often than not, we have prior information (i.e., from previous similar projects or from stakeholder consultations) – why not build that information into the models?
  • In applied scenarios, \(p\)-values are quite intuitive, often abused, and usually uninformative, especially for policy-making settings.
  • Confidence intervals are themselves even quite unintuitive – “If I repeated the test many times and calculated confidence intervals for each time, in \(95\%\) of the them the actual quantity of interest would fall within this interval.” That doesn’t seem overly useful compared to what a Bayesian could say, which is: “the true value of the quantity of interest is between X and Y with \(95\%\) probability.”
  • You can update as new data is available.
  • I just love the idea of treating everything as a probability distribution.

Most of my work usually involves complex statistical modelling, like Bayesian hierarchical models or probabilistic forecasting. As such, I’ve been itching to strip things back to a domain that is surging in importance as organisations become increasingly data literate and turn to things such as experiments to inform their decision making: A/B testing.

A/B testing

At its core, A/B testing is a pretty simple concept: you run two variants of some product or initiative, and you compare the responses of those exposed to each on a metric of interest. An example could be two variants of a marketing campaign, or two different emails to university students promoting joining an association, or two versions of a new software feature for Instagram. The beauty of A/B testing is that the randomised assignment to either the A or B condition eliminates common problems associated with causal inference, such as confounding (providing the assignment is in fact random!). This means the statistical methods used to evaluate A/B tests are relatively much simple yet powerful.

Those who have taken at least a statistics unit at university would likely be familiar with frequentist methods used to evaluate differences between two groups on a single continuous variable, such as \(t\)-tests. However, the selection of test very much depends on the distribution of the metric of interest. A \(t\)-test is appropriate for continuous outcome variables that are Gaussian distributed. But what if our metric was a rate – such as click-through rate on a website – which is bounded in the domain \([0,1]\)? What if our metric was number of items purchased in a store – which is evidently a count and therefore discrete? Thankfully, there are methods for each response distribution, such as Fisher’s exact test for binomial distributions, the E-test for comparing means of Poisson distributions, and many more.

Here we are going to focus on a common binomial outcome measured in marketing – conversion rate.

A/B testing as a Bayesian

Adopting a Bayesian approach to A/B testing is incredibly powerful. It enables you to answer questions such as ‘what is the probability that Product A is better than Product B?’ To do this, Bayesians require two things:

  1. Priors
  2. Data (i.e., ‘likelihood’)

Without diving too much into the low-level explanation of Bayesian inference as there are much better resources out there, a prior represents our belief before observing any data. Our priors and the likelihood of the actual data are then combined1 in a product to form a posterior distribution – our updated beliefs after observing the data.

With that little introduction out of the way, let’s get stuck in!

Simulating data

First of all we are going to simulate some data to work with. Since we are going to use the binomially-distributed conversion rate as our metric, we can make use of R’s rbinom function. We are going to functionalise this process so we can run some experiments to see how various aspects of the data impact the sensitivity of inferences. We will also load the rstan package for the R interface to the Stan probabilistic programming language which enables Bayesian inference, and ggplot2 for drawing plots. Note that we are storing our results in a list because this is the format that rstan expects.

library(rstan)
library(ggplot2)
#' Simulate basic A/B test data
#' 
#' @param n_a \code{integer} denoting the sample size for Group A. Defaults to \code{1000}
#' @param n_b \code{integer} denoting the sample size for Group B. Defaults to \code{1000}
#' @param p_a \code{numeric} denoting the true conversion rate probability of Group A. Defaults to \code{0.05}
#' @param p_b \code{numeric} denoting the true conversion rate probability of Group B. Defaults to \code{0.07}
#' @param seed \code{integer} fix for reproducibility. Defaults to \code{123}
#' @return \code{list} containing the results
#' @author Trent Henderson
#' 

simulate <- function(n_a = 1000, n_b = 1000, p_a = 0.05, p_b = 0.07, seed = 123){
  set.seed(seed)
  y_a <- rbinom(1, n_a, p_a)
  y_b <- rbinom(1, n_b, p_b)
  
  sims <- list(
    y_a = y_a,
    n_a = n_a,
    y_b = y_b,
    n_b = n_b
    )
  
  return(sims)
}

sims <- simulate()
sims
## $y_a
## [1] 49
## 
## $n_a
## [1] 1000
## 
## $y_b
## [1] 69
## 
## $n_b
## [1] 1000

We can see that (due largely to our decent sample size of \(N = 1000\)), our final number of conversions for each group is approximately the true ratio we specified: for Group A, \(\frac{49}{1000} = 0.049 \approx 0.05\) and for Group B, \(\frac{69}{1000} = 0.069 \approx 0.07\). If we instead used a much smaller sample size of \(N = 100\), we would get:

simulate(n_a = 100, n_b = 100)
## $y_a
## [1] 4
## 
## $n_a
## [1] 100
## 
## $y_b
## [1] 9
## 
## $n_b
## [1] 100

where the ratio of conversions for Group A is now \(0.04\) and Group B is \(0.09\). Of course, these exact numbers would vary according to the random seed fix we choose, but the impact of sample size on these convergence is well-studied – it’s just nice to point out again.

With our data in hand, we can now define a simple Bayesian model in Stan to understand conversion rates. A Stan model is written in C++ and is comprised of the following basic blocks:

  • data – the inputs Stan receives. Note that these names map to the data we passed into the list in our simulation function.
  • parameters – parameters to be estimated by Stan. In our case, these are just the conversion rate probabilities for each group.
  • model – where the magic happens! This is where we specify the probability distributions for our priors and likelihood.
  • generated quantities – literally anything you want Stan to estimate using the model! This can be forecasts, simulated data points etc. In our case, we are just going to calculate the difference in conversion rate probabilities between Group B and Group A for a nice posterior distribution.

NOTE: There are additional Stan blocks that are available in more complex settings, such as transformed data and transformed parameters. We do not need these here and so I have ignored them. Please see an older blog post of mine for an example usage of these blocks

conversion_bayes <- "
data {
  int<lower=0> y_a;
  int<lower=0> n_a;
  int<lower=0> y_b;
  int<lower=0> n_b;
}

parameters {
  real<lower=0,upper=1> p_a;
  real<lower=0,upper=1> p_b;
}

model {

  // Priors
  
  p_a ~ beta(1, 10);
  p_b ~ beta(1, 10);

  // Likelihood
  
  y_a ~ binomial(n_a, p_a);
  y_b ~ binomial(n_b, p_b);
}

generated quantities {
  real diff = p_b - p_a; // Difference in conversion rate probabilities
}
"

A detailed exploration of priors

Let’s talk about priors for a bit, as it’s a key reason why we have gone Bayesian. In the real world, we would make use of all the information available to us – stakeholder consultations, previous research, public reports, statistical intuition, etc. – to make plausible expectations about relationships before observing the data. For our contrived simulation this is awkwardly artificial, but we’ll make do.

Since a conversion rate is a rate probability, we are looking to specify a prior distribution that is bounded in the domain \([0,1]\)2. More specifically, we are seeking to specify a prior probability distribution over the *probability* parameter of the binomial distribution. A beta distribution is designed exactly for this purpose. Let’s explore beta distributions in more detail to build the intuition for what I did in the Stan code above.

The density of a beta distribution with parameters shape1 (\(a\)) and shape2 (\(b\)) – such as that used by R’s rbeta function – is written as:

\[ \frac{\Gamma (a+b)}{\Gamma (a) \Gamma (b)} x^{a-1} (1-x)^{b-1} \]

That’s pretty hairy, so let’s plot some data to understand what’s going on. We have \(N = 1000\) samples per group, so let’s stick with that and start with just \(a = 1\) and \(b = 1\) and see what the probability distribution looks like:

ggplot(data.frame(x = rbeta(1000, 1, 1)), aes(x = x)) + 
  geom_histogram(aes(y = after_stat(density)), fill = "steelblue2", alpha = 0.8) + 
  theme_bw()

Looks sort of uniform across the \([0,1]\) domain. Let’s halve \(a\) to be \(a = 0.5\):

ggplot(data.frame(x = rbeta(1000, 0.5, 1)), aes(x = x)) + 
  geom_histogram(aes(y = after_stat(density)), fill = "steelblue2", alpha = 0.8) + 
  theme_bw()

A complete shift! Now we can see most of the probability density is concentrated among smaller values, with a long tail out to \(1\). Let’s revert \(a = 1\) and make the same halving change but for \(b\):

ggplot(data.frame(x = rbeta(1000, 1, 0.5)), aes(x = x)) + 
  geom_histogram(aes(y = after_stat(density)), fill = "steelblue2", alpha = 0.8) + 
  theme_bw()

It goes the other way! What about if both \(a > 1\) and \(b > 1\)?

ggplot(data.frame(x = rbeta(1000, 5, 5)), aes(x = x)) + 
  geom_histogram(aes(y = after_stat(density)), fill = "steelblue2", alpha = 0.8) + 
  theme_bw()

Hopefully this is starting to build some intuition for what the shape parameters are doing. Now, in our example, we know (through simulation) the true values of p_a and p_b. But let’s assume we do not (as is the case in the real world), and instead use information about our customer base to build a prior. As stated earlier, we expect each product to drive some conversion rate, but we do not know what these rates are. Say we know from previous marketing initiatives that conversion rates at our company are, on average, about \(5\%\), irrespective of the initiative. Intuitively, we need a prior distribution that concentrates most of its probability density around \(5\%\), but with sufficient variability to capture our uncertainty about these new products. Such a prior might look like:

ggplot(data.frame(x = rbeta(1000, 1, 10)), aes(x = x)) + 
  geom_histogram(aes(y = after_stat(density)), fill = "steelblue2", alpha = 0.8) + 
  theme_bw()

Given we expect B to outperform A (which is captured in our difference calculation), but it is not likely to be by a massive margin given historical results, we can probably use the same prior for each group in our model since it includes sufficient uncertainty. You have probably noted by now that this is exactly what I did in the model code above.

Fitting the model to data

With a proper understanding of our prior beliefs, we can now fit our model to the data! The code to do this is pretty trivial – we just need to take care to use enough iterations per chain in order to ensure our tail densities are okay and that we have enough draws to make robust statistical inferences (basically, more is better, but there is a computational cost!).

conversion_model <- stan_model(model_code = conversion_bayes)

fit <- sampling(
  conversion_model,
  data = sims,
  iter = 5000,
  chains = 4,
  cores = 4,
  seed = 123
)

It’s not shown here, but note that for such a simple model, the majority of the computational overhead is in compiling the Stan code initially (i.e., the stan_model() call, which only happens once). The actual model ran in one second!

What can we do with this fit obect now? First, we need to extract the posterior distributions for each parameter and generated quantity:

posterior <- rstan::extract(fit)

We can now do a multitude of things! For example, what is the probability that Product B leads to a higher conversion rate than Product A?

mean(posterior$p_b > posterior$p_a)
## [1] 0.9717

\(97.2\%\)! Yikes! While a point estimate like this is exciting, the whole point (ha ha) of being a Bayesian is to utilise the full probability distribution. Here are the posterior conversion rate distributions for each product:

ggplot(data.frame(x = append(posterior$p_a, posterior$p_b), 
                  category = append(rep("A", times = length(posterior$p_a)),
                                    rep("B", times = length(posterior$p_a)))), 
       aes(x = x, fill = category)) + 
  geom_histogram(aes(y = stat(count / sum(count))), alpha = 0.5) + 
  geom_vline(aes(xintercept = mean(posterior$p_a)), linetype = "dashed", linewidth = 0.8, colour = "#F8766D") +
  geom_vline(aes(xintercept = mean(posterior$p_b)), linetype = "dashed", linewidth = 0.8, colour = "#00BFC4") +
  labs(x = "Posterior conversion rate",
       y = "Proportion of samples",
       fill = NULL) +
  theme_bw() +
  theme(legend.position = "bottom")

We can see that the peaks of each distribution are located close to the true values we simulated the data from. This is a nice sense check of the approach. Now, instead of plotting each distribution individually, we can also plot the posterior differences directly:

ggplot(data.frame(x = posterior$diff), aes(x = x)) + 
  geom_histogram(aes(y = stat(count / sum(count))), fill = "steelblue2", alpha = 0.8) + 
  geom_vline(aes(xintercept = mean(posterior$diff)), linetype = "dashed", linewidth = 0.8, colour = "steelblue2") +
  labs(x = "Posterior probability of B - A",
       y = "Proportion of samples") +
  theme_bw()

This is helpful, but we can see some cases where A does in fact beat B. Let’s visually highlight the two to make it clear:

diffs <- data.frame(x = posterior$diff)
diffs$category <- ifelse(diffs$x <= 0, "A > B", "B > A")

ggplot(diffs, aes(x = x, fill = category)) + 
  geom_histogram(aes(y = stat(count / sum(count))), alpha = 0.6) + 
  geom_vline(aes(xintercept = 0), linetype = "dashed", linewidth = 0.8) +
  labs(x = "Posterior probability of B - A",
       y = "Proportion of samples",
       fill = NULL) +
  theme_bw() +
  theme(legend.position = "bottom")

We are compiling compelling evidence for the efficacy of B over A!

As a final example, let’s compute a credible interval – that is, the range between X and Y with which the true difference in conversion rate between Product B and A lies with a given probability. We’ll use a \(90\%\) credibility interval here for no particular reason other than it’s boring to see \(95\%\) used everywhere:

quantile(posterior$diff, 0.05)
##          5% 
## 0.002623643
quantile(posterior$diff, 0.95)
##        95% 
## 0.03670149

This means we can make the following statement: The true difference in conversion rate between B and A is between 0.003 and 0.037 with \(95\%\) probability. We should therefore adopt Product B moving forward. We are now ready to meet with our internal stakeholders!

Parting thoughts

Thanks for reading this short little tour of a very useful method for real-world problem solving. I might do a follow-up piece in the future on some other Bayesian A/B testing ideas, so please check back for that.


  1. This is a very drastic oversimplification!↩︎

  2. This is known as a conjugate prior which is useful for both interpretability and computational efficiency, but we need not use one if it doesn’t accurately reflect our beliefs.↩︎