Explain Selection Bias in Online Experiment with Python
--
Recently, I read an article [1] about selection bias in online a/b tests, which describes a simple but intriguing phenomenon, i.e., app features usually produce less gain in production than in testing. I found this problem very interesting and decide to dive into it.
An Example
Intuitive Explanation
Hypothesis Test Quick Recap
Explanation with Math and Python
Summary
Reference
An Example
Suppose we have ten new features, and they are independent of each other, which means their effects are additive. Leveraging online A/B tests, we estimate the revenue improvement from each feature, shown in the Observed
column of the below figure.
We only want to ship features that can bring at least 2% improvement during online tests; therefore, we select the features with improvements of 2.7%, 2.6%, and 3.3%. Intuitively, we expect an improvement of 8.6% when these features are shipped. However, as shown in the True
column, our real improvement is only 6%.
Intuitive Explanation
We can easily give an intuitive explanation for this problem.
Because we cannot test new features on all users for an arbitrary length of time, results from online tests can be over-estimated or under-estimated. However, since we clamp all features that are less than a certain threshold (2% in the above example), we remove most of the under-estimated ones and tend to keep over-estimated ones.
In the following, I will give a real explanation and show that this problem comes entirely from estimation error.
By “entirely”, I mean in this hypothetical and simplified setting.
Hypothesis Test Quick Recap
Before we get into the explanation, I briefly summarize some knowledge of the hypothesis test in the below figure, which can help understand the problem.
Explanation with Math and Python
Let us define the improvements of features as random variables {Xₖ}, denote {Xₖᵒ} as the observed average improvement for each feature (Observed
in the example), and define μ as the improvement threshold (2% in the example).
Then with one-side t-test, we can define the shippable indicator function as
where n is the number of samples, sᵢ stands for the standard error of samples for Xᵢ, fₜ represents the inverse of cdf function for student’s t-distribution.
Afterward, we can define the bias between testing and production as
Clearly, we will have Bias > 0 when
Therefore, we can conclude that the selection bias comes when estimation error happens due to lack of samples, cause the Xᵢᵒ passes the hypothesis test, while its true average does not. (They are the over-estimated features)
[1] describes a method to offset the estimation error when the variance of Xᵢ is known.
Python Simulation
To get a more realistic feel, I simulate this selection bias phenomenon with python and validate the things we find.
First, we define some hyper parameters for the simulation
Then we define the distributions of {Xᵢ} as gaussian, whose μ and σ are drawn from another distribution below.
Afterward, we sample N features, do t-test and check Bias > 0 condition as above and collect bias_features
and unbias_features
.
Next, we calculate the histogram of estimation error for bias and unbias features
Meanwhile, as we increase the number of sample , the number of bias feature drops.
When we set n to 200000, the number of bias features goes to 0, which aligns with our above conclusion.
Summary
In this post, I talk about an interesting selection bias in online testing and its explanation. The simple takeaway is whatever operation done on your sampled data may bias the conclusion, especially when your data is not enough. Therefore, we need to be extra careful (and optimistic, of course 😀) . I hope this post is helpful to you, see you then :D
The python code for this post can be found in this notebook.
Reference
[1]: Selection Bias in Online Experimentation https://medium.com/airbnb-engineering/selection-bias-in-online-experimentation-c3d67795cceb