# Explain Selection Bias in Online Experiment with Python

Recently, I read an article  about selection bias in online a/b tests, which describes a simple but intriguing phenomenon, i.e., app features usually produce less gain in production than in testing. I found this problem very interesting and decide to dive into it.

# An Example

Suppose we have ten new features, and they are independent of each other, which means their effects are additive. Leveraging online A/B tests, we estimate the revenue improvement from each feature, shown in the `Observed` column of the below figure. Image from , “True” Column denotes the real improvement of this feature

We only want to ship features that can bring at least 2% improvement during online tests; therefore, we select the features with improvements of 2.7%, 2.6%, and 3.3%. Intuitively, we expect an improvement of 8.6% when these features are shipped. However, as shown in the `True` column, our real improvement is only 6%.

# Intuitive Explanation

We can easily give an intuitive explanation for this problem.

Because we cannot test new features on all users for an arbitrary length of time, results from online tests can be over-estimated or under-estimated. However, since we clamp all features that are less than a certain threshold (2% in the above example), we remove most of the under-estimated ones and tend to keep over-estimated ones.

In the following, I will give a real explanation and show that this problem comes entirely from estimation error.

By “entirely”, I mean in this hypothetical and simplified setting.

# Hypothesis Test Quick Recap

Before we get into the explanation, I briefly summarize some knowledge of the hypothesis test in the below figure, which can help understand the problem.

# Explanation with Math and Python

Let us define the improvements of features as random variables {Xₖ}, denote {Xₖᵒ} as the observed average improvement for each feature (`Observed` in the example), and define μ as the improvement threshold (2% in the example).

Then with one-side t-test, we can define the shippable indicator function as

where n is the number of samples, sᵢ stands for the standard error of samples for Xᵢ, fₜ represents the inverse of cdf function for student’s t-distribution.

Afterward, we can define the bias between testing and production as

Clearly, we will have Bias > 0 when

Therefore, we can conclude that the selection bias comes when estimation error happens due to lack of samples, cause the Xᵢᵒ passes the hypothesis test, while its true average does not. (They are the over-estimated features)

 describes a method to offset the estimation error when the variance of Xᵢ is known.

Python Simulation

To get a more realistic feel, I simulate this selection bias phenomenon with python and validate the things we find.

First, we define some hyper parameters for the simulation

Then we define the distributions of {Xᵢ} as gaussian, whose μ and σ are drawn from another distribution below.

Afterward, we sample N​ features, do t-test and check Bias > 0 ​ condition as above and collect `bias_features` and `unbias_features`.

Next, we calculate the histogram of estimation error for bias and unbias features Estimation Error of Bias Feature is significantly higher than Unbias one

Meanwhile, as we increase the number of sample ​, the number of bias feature drops.