Explain Selection Bias in Online Experiment with Python

Xinyu Zhang
4 min readJan 17, 2021

--

Recently, I read an article [1] about selection bias in online a/b tests, which describes a simple but intriguing phenomenon, i.e., app features usually produce less gain in production than in testing. I found this problem very interesting and decide to dive into it.

An Example
Intuitive Explanation
Hypothesis Test Quick Recap
Explanation with Math and Python
Summary
Reference

An Example

Suppose we have ten new features, and they are independent of each other, which means their effects are additive. Leveraging online A/B tests, we estimate the revenue improvement from each feature, shown in the Observed column of the below figure.

Image from [1], “True” Column denotes the real improvement of this feature

We only want to ship features that can bring at least 2% improvement during online tests; therefore, we select the features with improvements of 2.7%, 2.6%, and 3.3%. Intuitively, we expect an improvement of 8.6% when these features are shipped. However, as shown in the True column, our real improvement is only 6%.

Intuitive Explanation

We can easily give an intuitive explanation for this problem.

Because we cannot test new features on all users for an arbitrary length of time, results from online tests can be over-estimated or under-estimated. However, since we clamp all features that are less than a certain threshold (2% in the above example), we remove most of the under-estimated ones and tend to keep over-estimated ones.

In the following, I will give a real explanation and show that this problem comes entirely from estimation error.

By “entirely”, I mean in this hypothetical and simplified setting.

Hypothesis Test Quick Recap

Before we get into the explanation, I briefly summarize some knowledge of the hypothesis test in the below figure, which can help understand the problem.

Hypothesis Test in Six Steps

Explanation with Math and Python

Let us define the improvements of features as random variables {Xₖ}, denote {Xₖᵒ} as the observed average improvement for each feature (Observed in the example), and define μ as the improvement threshold (2% in the example).

Then with one-side t-test, we can define the shippable indicator function as

where n is the number of samples, sᵢ stands for the standard error of samples for Xᵢ, fₜ represents the inverse of cdf function for student’s t-distribution.

Afterward, we can define the bias between testing and production as

Clearly, we will have Bias > 0 when

Therefore, we can conclude that the selection bias comes when estimation error happens due to lack of samples, cause the Xᵢᵒ passes the hypothesis test, while its true average does not. (They are the over-estimated features)

[1] describes a method to offset the estimation error when the variance of Xᵢ is known.

Python Simulation

To get a more realistic feel, I simulate this selection bias phenomenon with python and validate the things we find.

First, we define some hyper parameters for the simulation

Then we define the distributions of {Xᵢ} as gaussian, whose μ and σ are drawn from another distribution below.

Afterward, we sample N​ features, do t-test and check Bias > 0 ​ condition as above and collect bias_features and unbias_features.

Next, we calculate the histogram of estimation error for bias and unbias features

Estimation Error of Bias Feature is significantly higher than Unbias one

Meanwhile, as we increase the number of sample ​, the number of bias feature drops.

As we collect more samples, we have less bias features

When we set ​n to 200000, the number of bias features goes to 0, which aligns with our above conclusion.

Summary

In this post, I talk about an interesting selection bias in online testing and its explanation. The simple takeaway is whatever operation done on your sampled data may bias the conclusion, especially when your data is not enough. Therefore, we need to be extra careful (and optimistic, of course ​😀)​ . I hope this post is helpful to you, see you then :D

The python code for this post can be found in this notebook.

Reference

[1]: Selection Bias in Online Experimentation https://medium.com/airbnb-engineering/selection-bias-in-online-experimentation-c3d67795cceb

--

--