A/B testing with the chi-squared test¶
A/B testing is used to "test two variants of the same variable". For real-world use-case scenarios, this can be:
- Do we get more sign-ups when the sign-up button is red or blue?
- Should we send our users one notification a day or three to get them to open the app?
- Does adding customer testimonials increase our click through rate?
There should be one thing that we change, and we should have some way of measuring the outcome (e.g. sign-up buttons clicked, times the app was opened, click through rate).
Let's say you've set your A/B test up and you've let it run for a while. You now have some data (e.g. the number of sign-ups) both for your current value of the variable (e.g. red sign-up button), which we'll call the "control", and the other value of the variable (e.g. blue sign-up button), which we'll call the "variant".
This is your data:
import pandas as pd
df = pd.DataFrame(
[
{"design": "control", "sign_ups": 1513, "page_views": 15646},
{"design": "variant", "sign_ups": 1853, "page_views": 15130},
]
)
df
design | sign_ups | page_views | |
---|---|---|---|
0 | control | 1513 | 15646 |
1 | variant | 1853 | 15130 |
Your manager asks you if you now change the sign-up button to blue. The variant has more sign-ups for less page views, so it should be an easy yes, right?
We can check the sign-up rate:
df["signup_rate"] = df["sign_ups"] / df["page_views"]
df
design | sign_ups | page_views | signup_rate | |
---|---|---|---|---|
0 | control | 1513 | 15646 | 0.096702 |
1 | variant | 1853 | 15130 | 0.122472 |
The sign-up rate for the blue button is definitely higher, but is it high enough? What if this is all due to random chance and the people shown the variant would've also signed up no matter what the color of the button?
This is where the chi-squared test comes in! (Note: there's a few different chi-square tests, the one we're using is Pearson's chi-squared test). The chi-squared test is used to evaluate how likely it is that an observed difference between outcomes arose by chance.
First, we need to re-structure our data. Instead of page views, we need sign-ups and did not sign-ups. We can then drop the remaining columns (remembering that row index zero is the control, and row index one is the variant).
df["not_sign_ups"] = df["page_views"] - df["sign_ups"]
df = df[["sign_ups", "not_sign_ups"]]
df
sign_ups | not_sign_ups | |
---|---|---|
0 | 1513 | 14133 |
1 | 1853 | 13277 |
Next, we need to decide on a significance value. This is the probability that the difference between the results is by chance, and not from our button change, in other words: our findings are statistically significant. If you want to be more sure that your button change actually increased sign-ups, then you want a lower significance value.
You might also hear the terms null hypothesis and alternative hypothesis. We consider the scenario where the button change provided no significant difference to the number of sign-ups as the null hypothesis (whereas the alternative hypothesis is the case where the button change did cause significant difference in the number of sign-ups). If we calculate the probability that our change was due to chance is lower than our significance value, then we reject the null hypothesis (and thus accept the alternative hypothesis).
Generally, if the switch to the new version costs more resources (time, money, effort, etc.) then you want to be more sure it's not due to chance, and thus have a lower significance value.
A significance value that is commonly used is 0.05, i.e. the probability that the difference is from random chance is 5% (or one in twenty).
Why do we pick the significance value now? Why can't we just do all the calculations, get a probability our change is due to chance and then decide if it's good enough or not? This is known as p-hacking or data dredging, by presenting outcomes as statistically significant, even when they are not. Ideally, we should declare our predicted outcomes as early as possible (probably before we even started collecting data).
With the significance value, we can find a chi-squared value we need to beat in order to declare our results statistically significant. First, we need to calculate the degrees of freedom, which, for the chi-squared test, is given by:
n_rows, n_cols = df.shape
degrees_of_freedom = (n_rows - 1) * (n_cols - 1)
degrees_of_freedom
1
scipy
has a handy function for calculating our target chi-squared value. (Note: we use 1-significance_value
, in other words: the probability of our change being significant.) We can also get the value from a table, such as this one.
significance_value = 0.05
import scipy.stats
scipy.stats.chi2.ppf(1 - significance_value, df=degrees_of_freedom)
3.841458820694124
So, for our change to be statistically significant, the chi-squared value should be >3.841. This is also known as the test-statistic.
For calculating the chi-squared value of our data, we use the formula:
$$\chi^2 = \sum^n_{i=1}\frac{(O_i-E_i)^2}{E_i}$$
$O_i$ is the number of observations of type $i$. In this case, we have four types of observations: number of sign-ups with the control variable (red button), number of no sign-ups with the control variable, number of sign-ups with the variant variable (blue button), and number of no-sign ups with the variant variable.
$E_i$ is the expected value of each type $i$, asserted by the null hypothesis.
To get the expected value, we need to, for each cell: multiply the sum of that cell's row by the sum of that cell's column, then divide this value by the total number of observations for all cells.
For example, the expected value for sign-ups using the control variant (top left cell in our DataFrame) is:
$$\frac{(1513 + 14133) \times (1513 + 1853)}{(1513 + 14133 + 1853 + 13277)} = 1711.217\dots$$
How does this actually calculate an expected value for the number of sign-ups for the control variant? It combines two parts: the percentage of people who signed up across both variants, which is:
$$\frac{(1513 + 1853)}{(1513 + 14133 + 1853 + 13277)} = 0.109\dots$$
We can think of this as "expected sign-up rate". The second part is multiplying the expected sign-up rate by the number of people who visted the control variant, which is $1513 + 14133) = 15646$, to get our expected sign-ups for the control variant:
$$0.109\dots \times 15646 = 1711.217\dots$$
Similarly, for the, e.g. bottom right column, we calculate the percentage of non sign-ups $(14133 + 13277) / (1513 + 14133 + 1853 + 13277)$ and then multiply it by the number of visitors to the variant, $(1853 + 13277)$.
We can implement it in our code like so:
import numpy as np
expected_values = np.zeros((n_rows, n_cols))
denominator = np.sum(df.values)
for i in range(n_rows):
for j in range(n_cols):
row_total = df.iloc[i, :].sum()
col_total = df.iloc[:, j].sum()
numerator = row_total * col_total
expected_value = numerator / denominator
expected_values[i, j] = expected_value
expected_values
array([[ 1711.21770211, 13934.78229789], [ 1654.78229789, 13475.21770211]])
Now we have the expected values, we can use the rest of the formula to calculate the individual values, which we sum to get the chi-squared value:
values = []
for i in range(n_rows):
for j in range(n_cols):
observed_value = df.iloc[i, j]
expected_value = expected_values[i, j]
value = np.square(observed_value - expected_value) / expected_value
values.append(value)
chi_squared_value = np.sum(values)
chi_squared_value
52.43919221285117
We have $\chi^2 = 52.439\dots$, which is higher than our target value of $3.841\dots$. Thus, our findings are statistically significant and we should hurry up and tell our manager to quickly change the color of that button!
In reality, we probably shouldn't be writing this code by hand. Instead we can use the scipy.stats.chi2_contingency
function on our DataFrame to calculate all of this for us (and can also use it as a check to make sure all of the above code is correct):
scipy.stats.chi2_contingency(df)
Chi2ContingencyResult(statistic=52.174972351180045, pvalue=5.076909122065231e-13, dof=1, expected_freq=array([[ 1711.21770211, 13934.78229789], [ 1654.78229789, 13475.21770211]]))
Looks good enough to me! Also note the pvalue
which says the odds of our results being due to random chance are astronomically low. Turns out people just love blue buttons.