LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo
Pith reviewed 2026-05-10 19:08 UTC · model grok-4.3
The pith
LLMs agree with game-theory optimal play in Ludo only 40-46% of the time and split into two incomplete behavioral archetypes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability.
What carries the argument
480 handcrafted spot scenarios grouped into 12 behaviorally distinct decision categories, each isolating one strategic choice in Ludo while holding other variables fixed.
If this is right
- LLM strategic reasoning in stochastic multi-agent settings covers only part of the optimal policy and leaves systematic gaps.
- Prompts that include past opponent actions can shift decisions even when the current board state is unchanged.
- The two observed archetypes suggest LLMs can be prompted or fine-tuned toward more balanced play by targeting the missing half of the strategy.
- LudoBench offers a lightweight testbed that can track whether future models close the 40-46% agreement gap.
Where Pith is reading between the lines
- Similar spot-based testing could expose comparable blind spots in other stochastic games or real-world planning tasks that involve uncertainty and opponents.
- The prompt sensitivity finding implies that history-aware prompting may be needed for reliable multi-turn decision making.
- Full-game win rates against the game-theory agent would reveal whether the spot-level archetypes compound into overall performance differences.
Load-bearing premise
The 480 spot scenarios isolate the intended strategic choices without interference from random dice outcomes or ongoing multi-player interactions.
What would settle it
Run the same six models on complete Ludo matches against the Expectiminimax agent and count how often their moves match the baseline across repeated trials of the same starting positions.
Figures
read the original abstract
We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning complexity. LudoBench comprises 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, each isolating a specific strategic choice. We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents. The game-theory agent uses Expectiminimax search with depth-limited lookahead to provide a principled strategic ceiling beyond greedy heuristics. Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability. LudoBench provides a lightweight and interpretable framework for benchmarking LLM strategic reasoning under uncertainty. All code, the spot dataset (480 entries) and model outputs are available at https://anonymous.4open.science/r/LudoBench-5CBF/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo consisting of 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, together with a 4-player simulator supporting Random, Heuristic, Expectiminimax (game-theory), and LLM agents. Evaluation of six models from four families shows 40-46% agreement with the Expectiminimax baseline; models divide into 'finisher' and 'builder' archetypes that each capture only half the baseline strategy; and models exhibit measurable shifts under history-conditioned grudge framing on identical board states. Code, the spot dataset, and model outputs are released.
Significance. If the handcrafted scenarios validly isolate the targeted strategic choices without confounding from dice expectations or opponent modeling, the work would demonstrate concrete limitations in current LLMs for stochastic multi-agent planning, including incomplete strategy capture and prompt sensitivity. The released simulator and dataset constitute a reproducible, lightweight framework that could support future controlled studies of LLM decision-making under uncertainty.
major comments (2)
- [§3] §3 (LudoBench construction) and abstract: the central claim that each of the 480 scenarios 'isolates a specific strategic choice' is load-bearing for the reported 40-46% agreement rates, archetype split, and grudge-framing results, yet the manuscript provides no explicit validation that move values have been computed as expectations over the dice distribution or under consistent opponent-policy assumptions. In Ludo, even fixed board states require such expectations; without this, the measured divergences could be artifacts of implicit single-agent or deterministic assumptions embedded in the handcrafting.
- [Results] Results section: the paper reports archetype classifications and behavioral shifts but contains no statistical tests for the significance of the 40-46% agreement figures, no per-category error breakdowns, and no description of the exact prompting templates or temperature settings used for the LLM agents. These omissions prevent assessment of whether the archetype and sensitivity claims are robustly supported by the data.
minor comments (2)
- [Simulator] The Expectiminimax implementation description would benefit from stating the exact search depth and any move-ordering or pruning used, to aid exact reproduction of the baseline.
- [§3] An illustrative example of one spot scenario from each of the 12 categories in the main text (rather than only in the released dataset) would improve interpretability of the decision categories.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be incorporated to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (LudoBench construction) and abstract: the central claim that each of the 480 scenarios 'isolates a specific strategic choice' is load-bearing for the reported 40-46% agreement rates, archetype split, and grudge-framing results, yet the manuscript provides no explicit validation that move values have been computed as expectations over the dice distribution or under consistent opponent-policy assumptions. In Ludo, even fixed board states require such expectations; without this, the measured divergences could be artifacts of implicit single-agent or deterministic assumptions embedded in the handcrafting.
Authors: We agree that the manuscript would benefit from greater explicitness on this point. The 480 scenarios were constructed by first running the Expectiminimax agent (which computes full expectations over the dice distribution at each node and assumes consistent opponent policies via the shared simulator) on candidate board states, then selecting states where the optimal move differs from heuristic baselines in a targeted behavioral dimension. To address the concern directly, we will expand §3 with a new subsection describing the Expectiminimax implementation, including the stochastic rollout procedure for dice expectations and the fixed opponent-policy assumptions. We will also add one worked example per decision category showing the board state, the dice distribution, the computed move values, and why the chosen move isolates the intended choice. These additions will make the isolation claim verifiable without altering the reported results. revision: yes
-
Referee: [Results] Results section: the paper reports archetype classifications and behavioral shifts but contains no statistical tests for the significance of the 40-46% agreement figures, no per-category error breakdowns, and no description of the exact prompting templates or temperature settings used for the LLM agents. These omissions prevent assessment of whether the archetype and sensitivity claims are robustly supported by the data.
Authors: We accept that these details were insufficiently reported. In the revised manuscript we will add (1) binomial or chi-squared statistical tests with p-values for the overall 40-46% agreement rates against a random baseline, (2) a supplementary table breaking down agreement and error types by the 12 decision categories to support the archetype analysis, and (3) the precise prompting templates (including the history-conditioned grudge framing) together with temperature settings (0.0 for all models to ensure reproducibility) in a new appendix. These changes require only additional text and tables; no new experiments are needed. revision: yes
Circularity Check
No significant circularity in LudoBench evaluation framework
full rationale
The paper introduces handcrafted spot scenarios and an independent Expectiminimax game-theory baseline implemented in a contributed simulator; the reported 40-46% agreement rates, behavioral archetypes, and grudge-framing shifts are direct empirical measurements of LLM outputs against this external baseline. No self-citations, fitted parameters, or self-referential definitions appear in the derivation chain, and the results do not reduce by construction to the inputs via any of the enumerated patterns. The framework is self-contained as a benchmark comparison.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Player 2 captured your piece last turn despite having a safer option
doi: 10.1109/CIG.2012.6374142. Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018. Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. InScience, volume 365, pp. 885–890, 2019. Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. Deep Blue.Artifi...
-
[2]
The heuristic always captures when possible
Capture( +100): Any legal capture on a non-safe square dominates all other move types. The heuristic always captures when possible
-
[3]
Home path progress(52 +): Entering or advancing in the home path is valued above all non-capture main-board moves
-
[4]
Leave base(50): Bringing a piece out of base is valued slightly below home entry but above most main-board advancement
-
[5]
Safe square( +20): Landing on a safe square provides a moderate defensive bonus, but does not override capture or home-entry incentives
-
[6]
Main-board advancement(0 –51): Basic forward progress, with higher scores for posi- tions further from start. 17 Preprint. Under review. G LLM Prompt Template The following is the complete prompt template provided to the LLM agent, repro- duced from llm agent.py. Fields in [CAPS] are populated dynamically from the spot configuration. In spot evaluation mo...
-
[7]
Main circular board: 52 squares (0-51)
-
[8]
Each player has a fixed START square
-
[9]
Tokens move forward relative to START
-
[10]
After one full lap (52 steps), tokens enter that player's HOME PATH
-
[11]
Each player has a UNIQUE HOME PATH (>= 52)
-
[12]
Final home position is HOME_END
-
[13]
Tokens must land EXACTLY on HOME_END
-
[14]
Overshooting HOME_END is illegal. -------------------- TOKEN STATES -------------------- - Position = -1 : Token is in base - Position 0-51 : Token is on main board - Position >= 52 : Token is in home path - HOME_END : Token has finished -------------------- GAME RULES --------------------
-
[15]
All tokens start in base (-1)
-
[16]
Leave base ONLY on dice = 6
-
[17]
Leaving base places token at START square
-
[18]
Tokens move forward by dice value
-
[19]
No stacking (one token per square)
-
[20]
CAPTURE: land on opponent on non-safe square -> opponent sent to base (-1)
-
[21]
Captures NEVER happen on safe squares
-
[22]
Safe squares protect tokens from capture
-
[23]
Rolling 6 grants an extra turn
-
[24]
No legal move -> turn skipped
-
[25]
First to move ALL tokens to HOME_END wins
-
[26]
Home paths are private to each player. 18 Preprint. Under review. -------------------- CURRENT GAME STATE -------------------- Number of players: [NUM_PLAYERS] Active player ids: [PLAYER_IDS] Dice rolled: [DICE] Your token positions (Player [PLAYER_ID]): [YOUR_TOKENS] Other players'token positions: [OTHER_PLAYERS_TOKENS] Your start square: [START_POS] You...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.