A/B testing with the chi-squared test¶

A/B testing is used to "test two variants of the same variable". For real-world use-case scenarios, this can be:

  • Do we get more sign-ups when the sign-up button is red or blue?
  • Should we send our users one notification a day or three to get them to open the app?
  • Does adding customer testimonials increase our click through rate?

There should be one thing that we change, and we should have some way of measuring the outcome (e.g. sign-up buttons clicked, times the app was opened, click through rate).

Let's say you've set your A/B test up and you've let it run for a while. You now have some data (e.g. the number of sign-ups) both for your current value of the variable (e.g. red sign-up button), which we'll call the "control", and the other value of the variable (e.g. blue sign-up button), which we'll call the "variant".

This is your data:

In [1]:
import pandas as pd

df = pd.DataFrame(
    [
        {"design": "control", "sign_ups": 1513, "page_views": 15646},
        {"design": "variant", "sign_ups": 1853, "page_views": 15130},
    ]
)

df
Out[1]:
design sign_ups page_views
0 control 1513 15646
1 variant 1853 15130

Your manager asks you if you now change the sign-up button to blue. The variant has more sign-ups for less page views, so it should be an easy yes, right?

We can check the sign-up rate:

In [2]:
df["signup_rate"] = df["sign_ups"] / df["page_views"]

df
Out[2]:
design sign_ups page_views signup_rate
0 control 1513 15646 0.096702
1 variant 1853 15130 0.122472

The sign-up rate for the blue button is definitely higher, but is it high enough? What if this is all due to random chance and the people shown the variant would've also signed up no matter what the color of the button?

This is where the chi-squared test comes in! (Note: there's a few different chi-square tests, the one we're using is Pearson's chi-squared test). The chi-squared test is used to evaluate how likely it is that an observed difference between outcomes arose by chance.

First, we need to re-structure our data. Instead of page views, we need sign-ups and did not sign-ups. We can then drop the remaining columns (remembering that row index zero is the control, and row index one is the variant).

In [3]:
df["not_sign_ups"] = df["page_views"] - df["sign_ups"]
df = df[["sign_ups", "not_sign_ups"]]

df
Out[3]:
sign_ups not_sign_ups
0 1513 14133
1 1853 13277

Next, we need to decide on a significance value. This is the probability that the difference between the results is by chance, and not from our button change, in other words: our findings are statistically significant. If you want to be more sure that your button change actually increased sign-ups, then you want a lower significance value.

You might also hear the terms null hypothesis and alternative hypothesis. We consider the scenario where the button change provided no significant difference to the number of sign-ups as the null hypothesis (whereas the alternative hypothesis is the case where the button change did cause significant difference in the number of sign-ups). If we calculate the probability that our change was due to chance is lower than our significance value, then we reject the null hypothesis (and thus accept the alternative hypothesis).

Generally, if the switch to the new version costs more resources (time, money, effort, etc.) then you want to be more sure it's not due to chance, and thus have a lower significance value.

A significance value that is commonly used is 0.05, i.e. the probability that the difference is from random chance is 5% (or one in twenty).

Why do we pick the significance value now? Why can't we just do all the calculations, get a probability our change is due to chance and then decide if it's good enough or not? This is known as p-hacking or data dredging, by presenting outcomes as statistically significant, even when they are not. Ideally, we should declare our predicted outcomes as early as possible (probably before we even started collecting data).

With the significance value, we can find a chi-squared value we need to beat in order to declare our results statistically significant. First, we need to calculate the degrees of freedom, which, for the chi-squared test, is given by:

In [4]:
n_rows, n_cols = df.shape

degrees_of_freedom = (n_rows - 1) * (n_cols - 1)

degrees_of_freedom
Out[4]:
1

scipy has a handy function for calculating our target chi-squared value. (Note: we use 1-significance_value, in other words: the probability of our change being significant.) We can also get the value from a table, such as this one.

In [5]:
significance_value = 0.05
In [6]:
import scipy.stats

scipy.stats.chi2.ppf(1 - significance_value, df=degrees_of_freedom)
Out[6]:
3.841458820694124

So, for our change to be statistically significant, the chi-squared value should be >3.841. This is also known as the test-statistic.

For calculating the chi-squared value of our data, we use the formula:

$$\chi^2 = \sum^n_{i=1}\frac{(O_i-E_i)^2}{E_i}$$

$O_i$ is the number of observations of type $i$. In this case, we have four types of observations: number of sign-ups with the control variable (red button), number of no sign-ups with the control variable, number of sign-ups with the variant variable (blue button), and number of no-sign ups with the variant variable.

$E_i$ is the expected value of each type $i$, asserted by the null hypothesis.

To get the expected value, we need to, for each cell: multiply the sum of that cell's row by the sum of that cell's column, then divide this value by the total number of observations for all cells.

For example, the expected value for sign-ups using the control variant (top left cell in our DataFrame) is:

$$\frac{(1513 + 14133) \times (1513 + 1853)}{(1513 + 14133 + 1853 + 13277)} = 1711.217\dots$$

How does this actually calculate an expected value for the number of sign-ups for the control variant? It combines two parts: the percentage of people who signed up across both variants, which is:

$$\frac{(1513 + 1853)}{(1513 + 14133 + 1853 + 13277)} = 0.109\dots$$

We can think of this as "expected sign-up rate". The second part is multiplying the expected sign-up rate by the number of people who visted the control variant, which is $1513 + 14133) = 15646$, to get our expected sign-ups for the control variant:

$$0.109\dots \times 15646 = 1711.217\dots$$

Similarly, for the, e.g. bottom right column, we calculate the percentage of non sign-ups $(14133 + 13277) / (1513 + 14133 + 1853 + 13277)$ and then multiply it by the number of visitors to the variant, $(1853 + 13277)$.

We can implement it in our code like so:

In [7]:
import numpy as np

expected_values = np.zeros((n_rows, n_cols))
denominator = np.sum(df.values)

for i in range(n_rows):
    for j in range(n_cols):
        row_total = df.iloc[i, :].sum()
        col_total = df.iloc[:, j].sum()
        numerator = row_total * col_total
        expected_value = numerator / denominator
        expected_values[i, j] = expected_value

expected_values
Out[7]:
array([[ 1711.21770211, 13934.78229789],
       [ 1654.78229789, 13475.21770211]])

Now we have the expected values, we can use the rest of the formula to calculate the individual values, which we sum to get the chi-squared value:

In [8]:
values = []

for i in range(n_rows):
    for j in range(n_cols):
        observed_value = df.iloc[i, j]
        expected_value = expected_values[i, j]
        value = np.square(observed_value - expected_value) / expected_value
        values.append(value)

chi_squared_value = np.sum(values)

chi_squared_value
Out[8]:
52.43919221285117

We have $\chi^2 = 52.439\dots$, which is higher than our target value of $3.841\dots$. Thus, our findings are statistically significant and we should hurry up and tell our manager to quickly change the color of that button!

In reality, we probably shouldn't be writing this code by hand. Instead we can use the scipy.stats.chi2_contingency function on our DataFrame to calculate all of this for us (and can also use it as a check to make sure all of the above code is correct):

In [9]:
scipy.stats.chi2_contingency(df)
Out[9]:
Chi2ContingencyResult(statistic=52.174972351180045, pvalue=5.076909122065231e-13, dof=1, expected_freq=array([[ 1711.21770211, 13934.78229789],
       [ 1654.78229789, 13475.21770211]]))

Looks good enough to me! Also note the pvalue which says the odds of our results being due to random chance are astronomically low. Turns out people just love blue buttons.