# Explain Selection Bias in Online Experiment with Python

--

Recently, I read an article [1] about selection bias in online a/b tests, which describes a simple but intriguing phenomenon, i.e., app features usually produce **less gain in production than in testing**. I found this problem very interesting and decide to dive into it.

An Example

Intuitive Explanation

Hypothesis Test Quick Recap

Explanation with Math and Python

Summary

Reference

# An Example

Suppose we have ten new features, and they are independent of each other, which means their effects are additive. Leveraging online A/B tests, we estimate the revenue improvement from each feature, shown in the `Observed`

column of the below figure.

We only want to ship features that can bring at least 2% improvement during online tests; therefore, we select the features with improvements of 2.7%, 2.6%, and 3.3%. Intuitively, we expect an improvement of 8.6% when these features are shipped. However, as shown in the `True`

column, our real improvement is only 6%.

# Intuitive Explanation

We can easily give an intuitive explanation for this problem.

Because we cannot test new features on all users for an arbitrary length of time, results from online tests can be over-estimated or under-estimated. However, since we clamp all features that are less than a certain threshold (2% in the above example), we remove most of the under-estimated ones andtend to keep over-estimated ones.

In the following, I will give a real explanation and show that this problem comes entirely from estimation error.

By “entirely”, I mean in this hypothetical and simplified setting.

# Hypothesis Test Quick Recap

Before we get into the explanation, I briefly summarize some knowledge of the hypothesis test in the below figure, which can help understand the problem.

# Explanation with Math and Python

Let us define the improvements of features as random variables *{Xₖ}*, denote *{Xₖᵒ}* as the observed average improvement for each feature (`Observed`

in the example), and define *μ* as the improvement threshold (2% in the example).

Then with one-side t-test, we can define the shippable indicator function as

where *n* is the number of samples, *sᵢ* stands for the standard error of samples for *Xᵢ, fₜ* represents the inverse of cdf function for *student’s t-distribution*.

Afterward, we can define the bias between testing and production as

Clearly, we will have *Bias > 0* when

Therefore, we can conclude that the selection bias comes when estimation error happens due to lack of samples, cause the

Xᵢᵒpasses the hypothesis test, while its true average does not. (They are theover-estimated features)

[1] describes a method to offset the estimation error when the variance of *Xᵢ* is known.

**Python Simulation**

To get a more realistic feel, I simulate this selection bias phenomenon with python and validate the things we find.

First, we define some hyper parameters for the simulation

Then we define the distributions of *{Xᵢ}* as gaussian, whose *μ* and *σ* are drawn from another distribution below.

Afterward, we sample *N* features, do t-test and check* Bias > *0 condition as above and collect `bias_features`

and `unbias_features`

.

Next, we calculate the histogram of estimation error for bias and unbias features

Meanwhile, as we increase the number of sample , the number of bias feature drops.

When we set *n* to 200000, the number of bias features goes to 0, which aligns with our above conclusion.

# Summary

In this post, I talk about an interesting selection bias in online testing and its explanation. The **simple takeaway** is whatever operation done on your sampled data may bias the conclusion, especially when your data is not enough. Therefore, we need to be extra careful (and optimistic, of course 😀) . I hope this post is helpful to you, see you then :D

The python code for this post can be found in this notebook.

# Reference

[1]: Selection Bias in Online Experimentation https://medium.com/airbnb-engineering/selection-bias-in-online-experimentation-c3d67795cceb