Learn AI Game Playing Algorithm Part II — Monte Carlo Tree Search

Xinyu Zhang
9 min readDec 21, 2020


In the last post, I introduce some background knowledge for game ai, namely, game representation, game categorization, and three important game-solving concepts.

In Part II, which is this post, I will introduce Monte Carlo Tree Search (MCTS), a popular algorithm suitable for solving perfect information games. Specifically, I will explain

  1. What is MCTS
  2. MCTS Procedure with pseudo-code
  3. Why MCTS can converge to the optimal strategy
  4. How MCTS applied to Alpha Go

Strictly speaking, the algorithm I present here is UCT [1], an improved version of MCTS, which is probably the most popular variant. You can find the original version in algo.1 of [2].

If terminologies here feel unfamiliar to you, I recommend you to read Part I Game Basics first.

What is MCTS

In Part I, we talk about the idea of simulation.

If we use our computer strategy to play with itself and build game trees, then from this tree we can extract a lot of knowledge and thus discover much stronger strategies.

This idea is the essence of MCTS, which is about how to build game trees upon current game state sₒ and calculate utility function u(s) or u(s, a) for future states. We can construct a new strategy π⁺(a|sₒ) using u(s), for example:

where (sₒ, a)s means the next game state of sₒ applying action a is s.

Since looking ahead during simulation can bring in new knowledge, π⁺(a|sₒ) usually performs better than existing π(a|sₒ) given enough time and computational budge. Therefore, we can understand MCTS as a strategy improving operator:

Namely, it takes in an existing strategy π(a|sₒ) and returns a better one π⁺(a|sₒ).

How MCTS works

In the following, I will explain the algorithm flow of MCTS with python style pseudo-code.

Image Annotated from [12], corresponding to the steps we explain in the following

As been shown below, we initialize the game tree with the current state as the root node, then make successive calls to the build function, which inserts nodes into the game tree and updates utilities incrementally.

Every node corresponds to a game state.

The more budget we have, the more times we call build; therefore, the closer π⁺(a|sₒ) approximates to the optimal strategy. The build function consists of three steps, as shown below:

1. Node Insertion: Find a region of interest in the tree, insert a leaf node, and return.
2. Monte Carlo Rollout (Rollout): Simulate the game from the state of node_i to the end, and return the terminal state reward.
3. Backprop: Propogate the reward information backward along the path from node_i to the root, updating the utilities for all nodes on the path.

For the sake of logical clarity, I will explain steps 2 and 3 first, followed by step 1.

Step 2: Rollout

Assuming the leaf node has been inserted, rollout simulates from leaf state to the end of game with a rollout policy. Alternatively, you could say that we samples a possible ending from a given leaf state based on a rollout policy. The pseudo-code of rollout follows:

Intuitively, the result of rollout from a leaf state sᵢ mostly be positive indicates that sᵢ and predecessors of sᵢ represent winning situations for the current player. Therefore, the rollout result, which is the reward of the sampled terminal state, can be used to construct the utility function for all states on the link between sᵢ and root.

We can use random as a default choice for rollout policy, namely rollout_policy = random.choice

The aim here is to estimate the utility function u(s), and in fact, rollout is just a means to an end. We can also use other methods, as Alpha Go [6] uses value function V(s) [7] to estimate utility in addition to rollout.

Step 3: Backprop

We have been talking about utility function u(s), but what is utility function? It really depends on how we define it, as long as we define it in a way that ensures that the function measures the value of a state s, or state action pair (s,a) to the player.

Therefore, we can certainly define the u(s) identical to the value function as in reinforcement learning (RL) [3], which is a fully compliant definition. Corresponding to the rollout step, here we define the utility function as:

where sₑ represents a possible terminal state, πᵣₒ(sₑ|s) represents the probability of reaching sₑ from s under the rollout policy.

In practice, we will not need to define the utility function explicitly in MCTS. We only need to define the “delta” of utility function, such as the rollout result, and accumulate deltas using backprop.

“backprop” is an interesting homonym to backpropagation in neural networks [4].

The procedure of backprop is quite simple, which is listed as below:

For all states along the path from leaf state to the root, we add R(sₑ)​ to the accumulated reward of each state. Obviously, n.acc_reward / n.visit is an unbiased estimator to ​u(s).

Step 1: Insert Node with Tree Policy

Combining Rollout & Backprop, we estimate the utility function for all states from a leaf node just inserted to the root. But how to determine where to insert the leaf node? This is where the tree policy comes in.

Assume each node insertion costs a fixed amount of time, then our budget in time is equivalent to the budget in the number of nodes we can insert. Moreover, the deeper we reach in a game tree branch, the more accurate our utility function estimation will be for all the states along that branch. As demonstrated in the below figure:

The utility estimation of the leftmost node (in dark red) is more accurate than other children of root.

Intuitively, we hope the tree policy is able to do both of the following:

  1. Provide estimations as accurate as possible for those possible winning states
  2. Find the optimal child state and action of root state, avoid to trap itself in a local optimum

The node insertion procedure is as follows:

We will always choose the best child until we meet a node not fully expanded, then we will expand the node by inserting another child according to a given prior probability πₚ(a|s).

The recursive condition of full expansion can also be customized.

We can see that the above strategy, which always looks for a good state to expand, easily satisfies the first requirement. Nevertheless, it is also easy to see that if such a strategy is followed, it can lead to a suboptimal subtree dominating if it initially looks better, thus not meeting the second requirement.

For games with opponents, we can simulate the opponent policy by choosing the worst child instead of the best one, as per the minimax search.

The way to solve this lies in how to define the best_child function.

Why MCTS finds the Optimal Strategy

We implement the best_child as follows:

Instead of choosing the best child solely based on utility function, we add another exploratory term sqrt(2 * log(N) / c.visit), the equation is

where N(s) is a function that returns the visit count of s, c is a constant. It is easy to see that the exploratory term is large if s’ is visited very few times, which means the estimation of u(s’) is in a large degree of uncertainty.

The second term can help us to avoid local optima due to statistical randomness. It can be proved that by following the above child selection method, MCTS can guarantee to find the optimal strategy under a sufficient time budget. However, the full proof is quite difficult; here I will only introduce the general idea of the proof.

The statements below are far from rigorous, let me apologize in advance. The real proof can be found in [1].

General Idea of the Proof

The best_child function actually implements the UCB1 policy [5] in the multi-armed bandit problem [8]. Let n be the number of trials, UCB1 guarantees that the number of times we choose the non-optimal bandit is O(log n) as long as n is large enough, which means

It is natural to think of best_child from the root node as a multi-armed bandit problem.

However, the multi-armed bandit problem assumes the reward distribution from each bandit is static. In contrast, the reward distribution of a child node is surely changing as its subtree is expanding through time.

A key observation is that while the reward distribution of MCTS varies over time, it also converges to a certain distribution over time because the game tree will eventually expand to be complete. Therefore, the non-stationary bandit problem will stabilize into a standard bandit problem.

With that in mind, we can look at the game tree from the bottom up

Deeper node stabilize faster than shallower one

Eventually, we will be solving a standard multi-armed bandit problem at the root node, and therefore with UCB1 we can discover the optimal π(a|s).

Application to Alpha Go

Alpha Go [6] is an extremely successful application for MCTS, a perfect combination of RL and MCTS. Here I briefly describe the various MCTS related techniques used in the Alpha Go and Alpha Zero [9], respectively.

Alpha Go

Compared to the MCTS presented in this post, Alpha Go makes several customizations to MCTS:

1. Defines utility function over the state-action pair as u(s, a), which means the core data structure will be Edge rather than Node. The utility estimation will be more granular with the domain of (s, a) larger than s alone.

2. Use a linear combination of a value network v_θ(s) and rollout result as the delta for utility function:

3. Use RL trained policy network as the rollout policy and tree expanding prior probability. Interestingly, a small policy network performs better for MCTS than a large and more accurate one, which illustrates the importance of exploration.

4. Use PUCT [11] instead of UCT, whose exploratory term is

which will prefer to select a child with higher prior probability as the best child at the beginning of the tree search.

Alpha Zero

Different from its predecessor, beyond using MCTS for prediction, Alpha Zero designs a self-play training method which uses MCTS, as a strategy improving operator, to obtain the supervisory signal. The figure below explains the training cycle in steps with number in circle:

Image annotated from [9]

By performing the above five steps in cycles, the neural network will continue to improve under the traction of MCTS and eventually converge. How marvelous!


In this post, I introduce Monte Carlo Tree Search, including its algorithmic procedure and applications. I hope you find it useful. In the next post, I will explain Counterfactual Regret Minimization [10] and how it solves imperfect information games like poker.


[1]: Improved Monte-Carlo Search http://old.sztaki.hu/~szcsaba/papers/cg06-ext.pdf

[2]: A Survey of Monte Carlo Tree Search Methods http://ccg.doc.gold.ac.uk/ccg_old/papers/browne_tciaig12_1.pdf

[3]: Value Iteration https://cs.uwaterloo.ca/~ppoupart/teaching/cs886-spring13/slides/cs886-module6-value-iteration.pdf

[4]: Backpropagation https://en.wikipedia.org/wiki/Backpropagation

[5]: Finite-time Analysis of the Multiarmed Bandit Problem https://homes.di.unimi.it/cesa-bianchi/Pubblicazioni/ml-02.pdf

[6]: Mastering the game of Go with deep neural networks and tree search https://www.nature.com/articles/nature16961

[7]: Value Funciton definition in RL http://incompleteideas.net/book/first/ebook/node34.html

[8]: Asymptotically Efficient Adaptive Allocation Rules https://citeseerx.ist.psu.edu/viewdoc/download?doi= (The first paragraph contains a brief introduction for multi-arm bandit problem)

[9]: Mastering the game of Go without human knowledge https://www.nature.com/articles/nature24270

[10]: Regret Minimization in Games with Incomplete Information https://papers.nips.cc/paper/2007/file/08d98638c6fcd194a4b1e6992063e944-Paper.pdf

[11]: Multi-armed bandits with episode contex http://gauss.ececs.uc.edu/Workshops/isaim2010/papers/rosin.pdf

[12]: Monte Carlo tree search https://en.wikipedia.org/wiki/Monte_Carlo_tree_search