Dive into Latent Dirichlet Allocation

6 min readNov 13, 2020

· Introduction
· Key Questions to Understand LDA
∘ Why Dirichlet Distribution?
∘ Why is Exact Inference intractable?
∘ How Variational Inference makes it tractable?
· Summary
· Reference

In this post, we will look at the Latent Dirichlet Allocation (LDA). LDA was proposed at [1] in 2003 and was widely used in the industry for topic modeling and recommendation system before the deep learning boom.

This post is not a tutorial of using LDA in action; for that, I recommend getting your hand dirty with this python library [2]. Instead, I will first give an overview introduction and then answer some crucial questions for a deeper understanding of LDA.

Topic distribution from LDA to a document

Introduction

LDA is a three-level hierarchical Bayesian model that models a document being generated word by word from a vocabulary. Like other probabilistic methods, LDA defines a prior distribution. LDA describes its prior distribution by a generative process, whose parameters can be learned with maximum a posteriori estimation.

The Generative Process

The generative process is straightforward and fun to learn. For each document w in a corpus D, LDA assumes the following three levels of steps:

Level 1 (Come up with a subject): Sample a topic distribution θ from Dir(α), where α is the Dirichlet parameter.
Level 2 (Find a topic under the subject): Sample a topic z from the multi-nomial topic distribution θ.
Level 3 (Write down a word from the topic): Sample a word w from P(w|z), where we denote βᵢⱼ as the multi-nomial parameter P(wⱼ=1|zᵢ =1).

Level-2 and 3 will iterate till the number of words meets the given document size. Intuitively, the model hierarchy brings LDA the ability to cluster words into topics and assign a topic distribution to document adaptively, which was quite ahead of time (compared to tf-idf [6]).

You may wonder what a topic is? The topic is a latent variable on the surface and only has its symbolic meaning to make up this hierarchical model. However, I prefer to understand a topic as a multi-nomial word distribution over the vocabulary.

Applications

LDA is widely used in the recommendation system. We can use the topic distribution for each document (or item) as a feature to calculate item similarity and recommend similar items to customers. We can also model users as documents, items as words to do user-based collaborative filtering. An item can be a piece of news, a web page, a description of a product, and more in practice.

Key Questions to Understand LDA

In the following, I will go through some critical questions to understand LDA better.

Why Dirichlet Distribution?

Assigning each document a topic distribution (score vector) to summarize its content relevancy is a fantastic and intuitive idea. Nevertheless, to do so, we need to model the distribution of those topic distributions, which is why we use Dirichlet distribution.

Dirichlet density function with parameters α₁, α₂, … αᵣ, N=∑ₖ αₖ, k ∈ {1,2,… r}, where αᵢ ∈ ℕ⁺ is

Therefore, each sample [θ₁,…,θᵣ] of Dir is a valid multinomial distribution, perfectly fitting the above use case.

Figure 4 from [1]. The topic simplex for three topics embedded in the word simplex for three words. Points inside each triangle is a topic/word distribution.

Recall the definition of a simplex (a generalized triangle) from [4], a set of points determines a simplex:

where {u₁,…,uₖ} are vertexes and u₂-u₁,…,uₖ-u₁ ∈ ℝᵏ⁻¹ are linearly independent. We can tell that Dirichlet distribution is defined over a simplex, which justifies the above figure.

Why is Exact Inference intractable?

Inference in LDA means to calculate the probability below:

where

The difficulty comes from calculating a statement like ∏ₙ ∑ₖ G(n, k) (where G represents a general function), which will produce nᵏ items. The n and k often take hundreds of thousands, and the integral of θ prevents using log function to reduce production to summation, making the computation intractable.

How Variational Inference makes it tractable?

Variational inference approximates the calculation of p(w|α, β) with three key steps:

Formulate a lower bound for log p(w|α, β) with Jensen inequality [5] and another arbitrary distribution q(θ, z)
Construct the q(θ, z) to be simple enough for calculation
Compensate the approximation error by minimizing the lower bound with parameters of q(θ, z) online (for each inference)

Lower Bound Formulation

Here I give a detailed version of Eq (12) from the original paper:

We can see that by introducing another arbitrary distribution q(θ, z) and using Jensen inequality, we get an alternative optimizing target L(q, α, β), which replaces the ∏ₙ ∑ₖ G(n, k) to ∑ₙ ∑ₖ log G(n, k) by putting log inside of integral. The next natural step is to design the q(θ, z).

Design the q(θ, z)

To simplify the computation, we design the q(θ, z) as follows.

Where q(θ ∣ γ) is a Dirichlet distribution with parameter γ and q(z ∣ ϕ) is a multi-nomial distribution with parameter ϕ. Designing q(θ | γ) and q(z ∣ ϕ) to be Dirichlet and multi-nomial, the same type of distribution of p(θ | α) and p(z), can greatly simplify the derivation and computation of the above lower bound.

Graphical Model of q (γ and ϕ are free parameters)

Moreover, γ and ϕ are free parameters, which means that θ and z are independent under q(θ, z) and they do not depend on w as well, making the final form of the lower bound even more concise.

Online Adaptation of q(θ, z)

When we approximate a value with a lower bound, it is usually desirable and natural to maximize the lower bound. In this case, it means to find the q(θ, z) that maximize the lower bound by tuning free parameters γ and ϕ. The deduction process is as follows:

1. Add Lagrangian multipliers to lower bound
2. Set partial derivative of lower bound to γ or ϕ to zero
3. Get the update equation for γ or ϕ

The variational inference process takes iterative updates for γ and ϕ until convergence then computes the p(θ, z ∣ w, α, β, γ, ϕ).

Note that you can also use other approximate inference methods like Markov chain Monte Carlo (MCMC) as in [3]. The idea is to ignore latent variable θ and sample directly from p(z|α, β, w) with Gibbs sampling since z is a sufficient statistics for θ (means θ can be estimated from samples of z). I will not go into detail here.

Parameter Estimation

The parameter estimation process for α and β is similar to that of γ and ϕ; the difference is that parameter estimation needs to compute over the whole corpus while the variational inference only cares about one document.

Summary

This post introduces Latent Dirichlet Allocation (LDA), a classical bayesian-based machine learning method. Latent Dirichlet Allocation is a truly stunning work, and it is big fun to write this article.

The original paper is decades old but still refreshing to read. It elaborates how the authors think the document writing process should work and condense their thoughts into a mathematical model, unlike most of the current papers that explain how their methods can make neural network learns more from data. Not to devalue neural network, I am a firm believer in the neural network, and I am looking forward to more inspiring work coming :D