Importance Sampling is a widely used technique in machine learning algorithms. For example, nearly all policy gradient methods rely on important sampling to reuse transitions from old episodes. The concept of Important Sampling is straightforward:
- Suppose we have samples from a distribution q(x)
- But we want to calculate the expectation of f(x) under a different distribution p(x), and for efficiency reasons, we want to reuse existing samples
- Then we transform the expectation over p(x) to q(x) like below
However, there is no magic in the world. The quality of the above approximation will depend on the p(x), q(x), and n. If p(x) and q(x) are very different, then the quality will be poor, or we need a large n to compensate.
Recently, I read about effective sample size, a technique to measure the important sampling quality. I didn’t understand it at first. But with some effort, I now finally figure it out. I write this post to record and share my understanding with you. This post depends on , which I recommend for further reading.
Use Variance to Measure Quality
The first and apparent idea is to use variance to measure quality. Let us denote μ as the Eₚ(f(x)), a random variable μ_q as the approximation:
We can see μ_q is an unbiased estimator of μ. And we want to calculate var(μ_q) to represent the quality. Suppose X₁, … Xᵢ are i.i.d. to each other, we can derive it like below:
The larger the variance is, the worse the quality we expect.
However, this measurement will be influenced by the variance of f(X) itself.
Let’s denote σ² as the variance of f(X). If σ² is large, we will get a large variance even the important sampling is good enough.
Reformulate the Variance
The idea is to separate the influence of important sampling out of the var(μ_q). It turns out we can reformulate Eₚ(f(x)) as follows:
The above formulation seems to be overkill for our problem, but it can be applied in a more general case called self-normalized important sampling, which we won’t elaborate on here.
Then with this formulation, we can derive var(μ_q) again
We can separate the original variance out of the var(μ_q) from the following equation.
Effective Sample Size
Finally, we can call out the effective sample size, which we denote as nₑ.
We can understand nₑ by first thinking about the scenario when p(x) equals q(x). In that perfect case, we literally sample from p(x); therefore, the σ_q equals σ, and all samples are effective.
Then go back to the normal case, we can imagine the condition that the variance of the normal case equals that of the perfect one, like below:
It means under current important sampling settings, we will have the same variance if we draw nₑ samples directly from p(x). How wonderful!
In this post, I briefly explain the concept of effective sample size in Important Sampling. I would say it is simple and elegant. I hope this post is useful to you. See you soon 😄
 Importance sampling https://www.math.arizona.edu/~tgk/mc/book_chap6.pdf