Intuitive Explanation of Effective Sample Size in Importance Sampling

Importance Sampling is a widely used technique in machine learning algorithms. For example, nearly all policy gradient methods rely on important sampling to reuse transitions from old episodes. The concept of Important Sampling is straightforward:

  • Suppose we have samples from a distribution q(x)
  • But we want to calculate the expectation of f(x) under a different distribution p(x), and for efficiency reasons, we want to reuse existing samples
  • Then we transform the expectation over p(x) to q(x) like below

However, there is no magic in the world. The quality of the above approximation will depend on the p(x), q(x), and n. If p(x) and q(x) are very different, then the quality will be poor, or we need a large n to compensate.

Recently, I read about effective sample size, a technique to measure the important sampling quality. I didn’t understand it at first. But with some effort, I now finally figure it out. I write this post to record and share my understanding with you. This post depends on [1], which I recommend for further reading.



Use Variance to Measure Quality

The first and apparent idea is to use variance to measure quality. Let us denote μ as the Eₚ(f(x)), a random variable μ_q as the approximation:

We can see μ_q is an unbiased estimator of μ. And we want to calculate var(μ_q) to represent the quality. Suppose X₁, … Xᵢ are i.i.d. to each other, we can derive it like below:

The larger the variance is, the worse the quality we expect.

However, this measurement will be influenced by the variance of f(X) itself.
Let’s denote σ² as the variance of f(X). If σ² is large, we will get a large variance even the important sampling is good enough.

Reformulate the Variance

The idea is to separate the influence of important sampling out of the var(μ_q). It turns out we can reformulate Eₚ(f(x)) as follows:

The above formulation seems to be overkill for our problem, but it can be applied in a more general case called self-normalized important sampling, which we won’t elaborate on here.

Then with this formulation, we can derive var(μ_q) again

We can separate the original variance out of the var(μ_q) from the following equation.

Effective Sample Size

Finally, we can call out the effective sample size, which we denote as nₑ.

We can understand nₑ by first thinking about the scenario when p(x) equals q(x). In that perfect case, we literally sample from p(x); therefore, the σ_q equals σ, and all samples are effective.

Then go back to the normal case, we can imagine the condition that the variance of the normal case equals that of the perfect one, like below:

It means under current important sampling settings, we will have the same variance if we draw nₑ samples directly from p(x). How wonderful!

End

In this post, I briefly explain the concept of effective sample size in Important Sampling. I would say it is simple and elegant. I hope this post is useful to you. See you soon 😄

Reference

[1] Importance sampling

SDE @ Microsoft

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store