Guide
Monte Carlo Tree Search explained
Minute twelve of a ranked chess match. Your queen sits on d3, both rooks are connected, and the opponent just pushed a knight to f6 — a move that looks quiet but opens three tactical forks on the next tempo. Brute-force minimax to depth eight evaluates 108 nodes and still misses the knight sacrifice on h7 that only pays off six plies later. A human titled player does not enumerate every line; they sample plausible continuations, notice which branches keep returning winning endings, and spend more analysis time on those ideas. Monte Carlo Tree Search (MCTS) automates that sampling loop: it builds a search tree incrementally, plays thousands of fast random (or learned) rollouts from each candidate move, and backs win/loss statistics up the tree so promising lines get explored more often. AlphaGo used MCTS plus deep neural networks to defeat world champions at Go; modern board-game engines, card agents, and turn-based strategy titles use variants of the same core. This guide covers the four MCTS phases, UCB1 exploration, rollout policies, neural MCTS, when MCTS beats classical minimax, a Harbor Chess midgame-fork worked example, an algorithm decision table, common pitfalls, and a production checklist — the planning layer that complements reinforcement learning and board-game design on our site.
What MCTS is (and what it is not)
Monte Carlo Tree Search is a best-first search algorithm for sequential decision problems — especially two-player, perfect-information games with large branching factors (Go, chess, shogi, hex, many card games). Instead of evaluating every child at fixed depth like minimax, MCTS allocates simulations dynamically: branches that look strong receive more rollouts; weak-looking lines are sampled just enough to avoid missing traps.
MCTS is not a replacement for all game AI:
- Real-time action games (fighters, shooters) rarely have time for thousands of rollouts per frame; behavior trees, utility AI, and perception-driven state machines dominate there.
- Hidden-information games (poker, bridge) need belief states or information-set search; vanilla MCTS assumes you see the full board.
- Tiny state spaces (tic-tac-toe, connect-four at low depth) are solved faster by alpha-beta pruning than by noisy rollouts.
Where MCTS shines
Use MCTS when legal moves per position exceed what depth-limited minimax can search, when evaluation functions are hard to hand-craft (Go before neural nets), or when you can run hundreds to millions of simulations per move offline or between turns. The algorithm is also a building block in RL pipelines: AlphaZero self-play generates training data by playing MCTS against itself.
The four phases: selection, expansion, simulation, backpropagation
Each MCTS iteration repeats four steps starting from the root (current game state):
- Selection — walk down the existing tree, choosing child nodes by a policy (usually UCB1) until you reach a node that still has unexplored legal moves.
- Expansion — add one new child node representing a single untried move from that position.
- Simulation (rollout) — play moves from the new node to a terminal state (win/loss/draw) using a fast default policy — often uniform random legal moves.
- Backpropagation — update visit counts and win statistics on every ancestor up to the root, incrementing wins for the side to move at each level according to rollout outcome.
After N iterations (bounded by a time budget or iteration count), MCTS picks the root child with the highest visit count — not necessarily the highest win rate, because visit count aggregates confidence from many simulations.
Tree reuse between moves
In turn-based games, the opponent’s reply narrows the tree: descend to the matching child and promote it as the new root, discarding sibling branches. Reusing prior work amortizes search cost across the game. Failing to reuse (rebuilding from scratch each ply) wastes CPU on mobile or web targets.
UCB1 and the exploration-exploitation tradeoff
During selection, each child node stores visit count N and win
count W. The classic UCB1 (Upper Confidence
Bound) score balances exploiting high win rates with exploring rarely
visited moves:
UCB1 = W/N + C × sqrt(ln(N_parent) / N)
The exploitation term W/N favors moves that won past rollouts.
The exploration term grows when a child has few visits relative to its
parent. Constant C (often sqrt(2) for [0,1] rewards) tunes
how aggressively you probe uncertain lines. TPUCT variants in AlphaGo
replace the second term with a prior from a policy network — the same
mathematical idea as UCB in
multi-armed bandits:
you have a limited budget of pulls (simulations) and must decide which lever
(move) to sample next.
Progressive widening and RAVE
In games with hundreds of legal moves (Go), expanding every child at once is wasteful. Progressive widening limits how many children exist as visit count grows. RAVE (Rapid Action Value Estimation) shares statistics across moves that appear in the same rollout, giving early signal before a specific node accumulates direct visits — useful when simulations are expensive.
Rollout policies: random, heuristic, and neural
Rollout quality dominates MCTS strength. Pure random play to the end is easy to implement but noisy: a brilliant sacrifice looks losing if random replies throw away the advantage. Common upgrades:
- Lightweight heuristics — capture highest-value piece, avoid immediate blunders, prefer center control in chess. Cheap and 10–50× stronger than uniform random.
- Playout policies trained offline — small networks or logistic models that mimic human amateur play.
- Truncated rollouts + value net — stop after k moves and evaluate the leaf with a learned value function instead of playing to terminal — the AlphaGo/AlphaZero pattern.
Rollouts must be fast. A chess engine budget of 800 ms per move might run 40,000 iterations; if each rollout averages 50 plies of move generation, microseconds per step matter. Bitboards, incremental Zobrist hashing, and precomputed attack tables are standard engineering, not optional optimizations.
Neural MCTS: AlphaGo and AlphaZero
AlphaGo (2016) combined MCTS with two neural networks: a
policy network p(a|s) that predicts expert
moves (narrowing which children to expand) and a value network
v(s) that estimates win probability from a position (replacing
random rollout endings). Selection uses PUCT; leaf evaluation calls the value
net instead of simulating to the end.
AlphaZero dropped human game databases: self-play games generated by MCTS produced training positions; the network improved; stronger networks produced better MCTS; repeat. The same loop mastered Go, chess, and shogi from rules alone. For indie developers, full AlphaZero training is usually impractical, but inference-time neural MCTS with a small net trained on your own puzzle positions can beat hand-tuned heuristics in custom board games where no Stockfish exists.
Harbor Chess: midgame knight fork worked example
Harbor Chess is our internal teaching engine. Consider this simplified position after 1.e4 e5 2.Nf3 Nc6 3.Bc4 Nf6 4.Ng5 d5 5.exd5 Na5 — Black ’s knight on f6 eyes e4 and h7; White’s bishop on c4 pins f7 indirectly. MCTS with 2,000 random rollouts in 300 ms might show:
d5recapture lines — 42% win rate, 380 visitsNxf7sacrifice — 58% win rate, 920 visits (UCB kept exploring after early wins)Bxf7+— 51% win rate, 700 visits
Random rollouts from Nxf7 often regain material or mate on f7
when Black mis-defends; the tree concentrates visits on the sacrifice even
though one-ply minimax with a material-only eval might reject it. After the
opponent plays Qe7, tree reuse promotes the Nxf7
subtree; the next search budget focuses on whether Bxf7+ or
Ne5 refutes the queen. This mirrors how production engines
blend MCTS-style statistics with tactical quiescence search — Harbor
logs visit counts in debug UI so designers see why the AI chose a non-obvious
move.
Algorithm decision table
| Approach | Best for | Limitation | Typical budget |
|---|---|---|---|
| Minimax + alpha-beta | Chess-like games, strong eval functions | Fixed depth misses long tactics | Depth 8–14 plies |
| Plain MCTS + random rollouts | Prototyping, custom rulesets, Go-like games | Noisy without good rollouts | 1k–100k sims/move |
| MCTS + policy/value net | Competitive agents without hand eval | Training pipeline cost | 800 sims with GPU inference |
| Heuristic greedy / utility AI | RTS, RPG companions, mobile opponents | No lookahead guarantees | Sub-millisecond |
| Tablebase / solved endgames | ≤7 piece chess, connect-four | Does not cover opening/midgame | Lookup O(1) |
Common pitfalls
- Choosing max win rate instead of max visits at the root — high win rate on 3 visits is noise; visit count is the robust pick.
- Symmetric games without perspective flip — backprop must credit the player who stood to move at each node.
- Illegal or slow move generation in rollouts — dominates wall time; profile before tuning UCB constants.
- No tree reuse — recomputing from root every ply multiplies latency on web clients.
- Applying vanilla MCTS to simultaneous or hidden moves — needs determinization or IS-MCTS extensions.
- Overfitting self-play nets to one opening — diversify training positions or the agent collapses to gimmick lines.
- Ignoring draws — threefold repetition and stalemate must score as 0.5 wins in backprop or endgame behavior drifts.
Practitioner checklist
- Implement the four phases with unit tests on tic-tac-toe (known outcomes).
- Log root children: visits, wins, UCB score, chosen move each turn.
- Profile rollout throughput (simulations per second) before neural nets.
- Add tree reuse on opponent moves; verify memory stays bounded (prune old roots).
- Swap random rollouts for capture-aware or policy-guided playouts; measure Elo gain.
- Tune exploration constant C on a fixed puzzle suite; do not guess.
- Handle draws, resign thresholds, and time controls explicitly in terminal checks.
- For shipped titles, cap thinking time and show “AI is thinking” with cancel on mobile.
- Regression-test tactical puzzles where MCTS must find known mates in 3.
- Document when the AI is MCTS vs scripted so designers set difficulty expectations.
Key takeaways
- MCTS allocates search effort by statistics from simulated playouts, not fixed-depth tree search.
- UCB1 formalizes exploration vs exploitation; neural priors extend the same idea.
- Rollout quality matters more than iteration count once basics are correct.
- AlphaZero-style self-play links MCTS to reinforcement learning for superhuman custom games.
- Tree reuse and fast move gen separate prototype MCTS from shippable turn-based AI.
Related reading
- Reinforcement learning explained — policies, value functions, and self-play training loops
- Multi-armed bandits explained — UCB, Thompson sampling, and exploration budgets
- Board game design explained — action spaces, turn structure, and AI-friendly rules
- Genetic algorithms explained — evolving heuristics when analytic search is too costly