# Learn AI Game Playing Algorithm Part III — Counterfactual Regret Minimization

In Post I and II, I have introduced some basics of AI game-playing algorithms, and Monte Carlo Tree Search (MCTS) , a popular algorithm suitable for solving perfect information games.

In Part III, which is this post, I will introduce Counterfactual Regret Minimization (CFR) , a popular algorithm for solving imperfect information games in current literature, and some of its variants . Specifically, I will talk about

1. Information Set, the key definition that models partially observable game state
2. Nash Equilibrium Approximation, the framing idea for CFR, which we briefly introduced in Part I. Here we will revisit this topic and provide some extra details.
3. Counterfactual Regret Bound, an optimizable upper bound that measures the distance to Nash equilibrium
4. CFR Procedure with Pseudo-Code
5. Monte Carlo CFR , a widely-used variant of CFR, more efficient and suitable for larger games 

Because the CFR algorithm is not a widely known algorithm, we need to introduce some new definitions along the way, making this subject seem a little more complicated than previous posts. Nevertheless, I will try my best to write this post as straightforward and intuitive as possible but with enough depth.

I want to note again that the mathematical language is not rigorous enough in this series, and please excuse me for this! If any problems found, please point them out 😃

References of many of the following points can be found at , which I find very well written and helpful for the current subject.

# Information Set

Recall Part I, we talk about the critical difference between a perfect information game and an imperfect one:

In a game with imperfect information, players cannot observe the actual game state s, which means we cannot use π(a|s) to represent strategy.

The insight is that enclosure of information makes players cannot distinguish certain individual states.

As illustrated from the above figure, we can define our strategy as π(a|I) instead of π(a|s), where I is the information set. According to , an information set is a set of decision nodes such that:

When the game reaches the information set, the player with the move cannot differentiate between nodes within the information set. All nodes in an information set belong to one player.

To explain it more clearly, I draw a game tree of Kuhn Poker  below. Kuhn Poker is an extremely simplified form of poker.

How Kuhn Poker Play

Two players each receive a card but do not show it to the other. Each player may initiate a card match (`action=call, bet=1`), and the other player may either accept the match (`action=call`), concede (`fold`) or raise the bet (`action=raise, bet=2`).

- If they both call, the larger one wins the bet. The bet can be either 2 or 1, depending on whether the raise happens.
- If either one of two folds, another player wins 1.
- The game is zero-sum.

It should be very clear from the above figure that what an information set is. In the following, we will define some symbols that used throughout this article:

• I denotes an information set
• h denotes an action sequence (or game history), h∈ I
• A(h), A(I) denote the set of available actions at h or I
• P(h) and P(I) denote functions that return the player with the move, given action sequence or information set.
• Z denotes the set of terminal histories (like the gray nodes in above figure). We write h ∈ Z to denote the game history that ends in Z.
• We will use σ to denote strategy, and σᵢ denotes the strategy for player i, σ(a|I) is the the probability of choose action a at I.
• We will use π to denote probability, π(h|σ) is the probability of reaching h with strategy σ.
• We use -i to denote players other than i, thereby define πᵢ and π₋ᵢ as contributions to probability π by player i and other players, i.e. π(h|σ) = πᵢ(h|σᵢ) π₋ᵢ(h|σ₋ᵢ); σᵢ and σ₋ᵢ as strategies of player i and other players, uᵢ and u₋ᵢ as utilities of player i and other players.
• u(z), z∈ Z denotes the utility of terminal history
• u(h|σ) denotes the utility of a game history, it can be recursively defined as below, and u(root|σ) can be written as u(σ).

# Nash Equilibrium Approximation

Recall Part I, we talk about the concept of Nash equilibrium and its corresponding strategy

Nash equilibrium strategy is a conservative but good strategy, though not optimal, like in prisoner dilemma, it works well in practice. Many successful works on solving imperfect information games are based on the idea of optimizing strategy to approximate Nash equilibrium state.

An approximation of a Nash equilibrium is called ϵ-Nash equilibrium, which is formalized in an example of a two-player game below:

However, how do we find a strategy σ like the above? The entry point lies in the concepts of average strategy and average strategy regret.

## Average Strategy and Regret

Given a player i plays from time 1 to T, let σᵗᵢ denotes his strategy in t, we can derive the average strategy over time as The equation is expanded to explain how “average over time” is derived.

At the same time, we can define the average strategy regret at time T for player i as

We can see that the Rᵢᵗ measures the utility distance between current strategy σᵗᵢ and the Nash equilibrium strategy σ*ᵢ for player i. Then it comes the key theorem that opens a possibility of Nash equilibrium approximation:

Theorem-1: In a zero-sum game at time t, if both player’s average overall regret is less than ϵ, then the average strategy is a 2ϵequilibrium strategy.

This theorem tells us that if we can minimize Rᵢᵗ to near zero, then we just need to average all the strategies used over time, which will be the Nash equilibrium strategy. It is fascinating 🤪

## Proof of above Theorem

After some surveys, I cannot find official proof for the theorem, which is more like a folk theorem. Therefore, I present my proof steps. Let us assume uᵢ is convex:

# Counterfactual Regret Bound

With the theorem above, now our task is to find a method that can minimize Rᵗᵢ by updating σᵗ iteratively, which is precisely the counterfactual regret minimization (CFR) method. There are two main points of CFR:

1. Discover an upper bound for Rᵗᵢ with counterfactual regret Rᵢᵗ(I, a)
2. Propose to use an online optimization method called regret matching to minimize each Rᵢᵗ(I, a) individually by updating σᵗ(a|I).

The counterfactual regret Rᵢᵗ(I, a) is defined on the level of the information set as below

where σᵗ{I → a} is the strategy that identical to σᵗ except that it will always choose action a when encountering I, π(z|h, σ) is the probability of reaching z given current history h.

## Understanding Counterfactual Regret

We first look at the counterfactual value function vᵢ(σᵗ, I), which consists of three parts, as illustrated in the following figure:

We can see that vᵢ(σᵗ, I) is a reasonable definition to measure the utility of information set I based on the possibility of reaching I and potential terminal utility from I.

“Counterfactual” means contrary to fact. Intuitively, the counterfactual regret closely measures the potential gain if we do something contrary to the fact, such as deliberately making action a at I instead of following σᵗ.

## Upper Bound for Average Strategy Regret

Upon the definition of counterfactual regret, we discover the following theorem, which makes us one more step closer to approximate the Nash Equilibrium.

To understand this bound, we can treat Rᵗᵢ ᵢₘₘ(I, a) as a distance between the current strategy σᵗ and another strategy σᵗ{I → a*}, which can act perfectly at information set I while identical to σᵗ in other places. Therefore, the smaller the Rᵗᵢ ᵢₘₘ(I, a) is, the better the strategy σᵗ is at I. By summing over all information sets, we can optimize the overall performance of σᵗ.

The above theorem tells us that we can minimize Rᵗᵢ by just minimizing each Rᵢᵗ(I, a), which significantly simplifies the problem.

## Regret Matching

After the simplification, we still need a method to minimize each Rᵢᵗ(I, a) by updating σᵗ(a|I). It turns out the current problem is in line with the definition of an online convex optimization algorithm called regret matching , which provides us the following iterative formula to update the strategy σᵗ(a|I) for each timestep.

At this point, we have a complete picture of CFR. In CFR, we use regret matching to update our strategy σᵗ iteratively and return the average strategy as the approximated Nash equilibrium strategy after enough iterations. It is that simple 😃

The full proof of theorem 2 can be found in the appendix of . This section does not intend to be a full introduction of the proof and only takes a little note on one of the proving steps.

Meanwhile, sorry for skipping the discussion of the proof of regret matching; that topic might need a separate post.

The crucial step in the proof is Eq (10) in A.1 , which seems not to be evident at first glance, while other steps are relatively easier to follow. I expand the key segment of Eq (10) in the following.

# Algorithm Procedure

After we have a basic understanding of the mathematical principles of CFR, we now introduce some implementation details. Recall the definition of counterfactual value function:

Therefore, to cover links to all possible endings z from h, we need a full depth-first traversal of the game tree to compute counterfactual value and regret. We usually implement the traversal using recursion.

The pseudo-code of CFR is below.

This implementation is simple, annotated, and corresponds to the symbols and procedures we explained in the previous section. Therefore, I will not go into details. But there are two things to mention:

1. Every call of `cfr` returns the counterfactual value divided by π₋ᵢ(h|σ) to simplify its recursive calculation.
2. We omit the denominator of the average strategy, which is a constant that will not affect the resulting probability.

# Monte Carlo CFR

I will introduce a popular variant of CFR, Monte Carlo CFR (MCCFR) . But before we dive into that topic, let us first calculate the time and space complexity of CFR.

Clearly, the space complexity is O(|I| * |A|), where |I| stands for the total number of information sets because we need to store Rᵗᵢ(I, a) and σᵗ(a|I). The time complexity is O(|h|* N), where |h| stands for the number of histories (or nodes) in the game tree and N denotes the number of iterations. For a large game like Texas Hold’em, |h| can go up to 10¹⁸ , which CFR is not efficient enough to solve. This is where MCCFR comes in to solve.

## General Idea

The idea of MCCFR is very straightforward. Instead of traversing the entire game tree and reaching every possible game endings z, we sample a small number of z according to a prior distribution q(z) and only traverse links that reach them.

Specifically, we split Z into subsets of {Q₁, …, Qᵣ}; we call one of these subsets a block, where each block Qᵢ is a smaller set of terminal histories.

Let qⱼ be the probability of choosing Qⱼ, then at each iteration, we first sample a block Qⱼ and only traverse parts of the game tree that reach game ending z ∈ Qⱼ.

Meanwhile, we calculate a sampled counterfactual value as below:

We can easily prove that the expectation of sampled counterfactual value is equal to the original one, which means MCCFR can work just as well as CFR, but simultaneously has great flexibility in designing the sampling strategy, making it much more efficient.

When there is only one Q₁, Q₁ = Z, then vˢᵢ(σ, I |j) = vᵢ(σ, I), MCCFR is equal to CFR.

Theoretically, we only need to change the calculation of vᵢ(σ, I) and sampling of z in CFR to become MCCFR. In practice, it is difficult to sample z without actually reaching all of them by recursion. Therefore, we must cleverly design q(z) in order to sample z but without an extra step of reaching them. I will briefly introduce two sampling methods that meet the requirement.

## Outcome Sampling

In outcome sampling, we make the following designs:

1. Each block Qⱼ only contains a single terminal history z
2. Set q(z) = π(z|σᵗ)

It seems simple, but they indicate we only need to sample one action at each step, instead of all actions like in CFR pseudo-code

Intuitively, it is like we merge the sampling of z with the process of recursive computation. At the deepest level of recursion, we return π(z|σᵗ) as q(z), which will be used by the upper level to calculate vˢᵢ(σ, I |j).

The time complexity of Outcome Sampling MCCFR is O(d * N), where d is the depth for the game tree, much smaller than |h|.

By outcome sampling, the more likely a terminal history z happens under strategy σ, the more we will sample it, which is reasonable. In practice, to balance between exploitation and exploration, we can also use ϵ-on-policy, which sample actions from a linear combination of σᵗ(a|I) and uniform distribution.

## External Sampling

In external sampling, we set q(z)=π₋ᵢ(z|σ), which looks similiar to outcome sampling, but we exclude the influence of current player strategy σᵢ over q(z).

To archive that, we will need to traverse all available actions at {h | P(h) = i}, for game histories where other players with the move, we sample an action according to σ₋ᵢ, just like the outcome sampling.

## Further Notes for MCCFR

There are many more details we could discuss further about MCCFR, such as

• The method of calculating the average strategy. Because we have biased the σᵗ with sampling strategy on z, and Rᵗᵢ(I, a) is unable to update for all information sets at every timestep. Therefore, we need a new way to compute the average strategy.
• Convergence speed for different sampling methods.

However, I will not discuss them here for reasons of space; this post has already gotten way longer than I expected 😂. If you are interested, I highly recommend you to read the phd thesis from Marc Lanctot , which is very well written and covers this topic comprehensively.

# Summary

This post introduces Counterfactual Regret Minimization (CFR), a popular method for solving imperfect information games. I hope it is helpful to you.

CFR is a truly fascinating work, which starts from profound principles and arrives at a simple but powerful algorithm framework. The background knowledge and mindset are quite different from other classical machine learning methods. It is an exciting challenge and big fun to write this article.

At the same time, I have finished the plan to write three blogs about the AI game-playing algorithm. I will probably write more blogs on this topic in the future. See you then :D

# Reference

 A Survey of Monte Carlo Tree Search Methods http://ccg.doc.gold.ac.uk/ccg_old/papers/browne_tciaig12_1.pdf

 Monte Carlo Sampling for Regret Minimization (MCCFR) in Extensive Games http://mlanctot.info/files/papers/nips09mccfr.pdf

 Information set (game theory) https://en.wikipedia.org/wiki/Information_set_(game_theory)

: Superhuman AI for multi-player poker https://science.sciencemag.org/content/365/6456/885

: Kuhn poker https://en.wikipedia.org/wiki/Kuhn_poker

: Monte Carlo Computation Sampling And Regret Minimization For Equilibrium And Decision-making In Large Extensive Form Games, Marc Lanctot, Phd Thesis http://mlanctot.info/files/papers/PhD_Thesis_MarcLanctot.pdf