pith. sign in

arxiv: 2604.05681 · v1 · submitted 2026-04-07 · 💻 cs.AI · cs.CL· cs.GT· cs.LG· cs.MA

LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

Pith reviewed 2026-05-10 19:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.GTcs.LGcs.MA
keywords LudoLLM evaluationstrategic reasoningboard gamesbehavioral archetypesprompt sensitivitygame theory baselinestochastic planning
0
0 comments X

The pith

LLMs agree with game-theory optimal play in Ludo only 40-46% of the time and split into two incomplete behavioral archetypes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LudoBench, a set of 480 handcrafted board positions in the dice-driven game Ludo, each designed to test one narrow strategic decision such as whether to capture an opponent or advance toward home. A game-theory agent that performs limited lookahead search supplies a clear baseline for good play. Six LLMs from four families are tested on these positions and match the baseline only 40-46 percent of the time. The models fall into two consistent patterns: finishers that push pieces home but leave others undeveloped, and builders that move pieces forward but never complete them. The same models also change their choices on identical boards when the prompt includes a history of past captures, showing sensitivity to framing.

Core claim

Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability.

What carries the argument

480 handcrafted spot scenarios grouped into 12 behaviorally distinct decision categories, each isolating one strategic choice in Ludo while holding other variables fixed.

If this is right

  • LLM strategic reasoning in stochastic multi-agent settings covers only part of the optimal policy and leaves systematic gaps.
  • Prompts that include past opponent actions can shift decisions even when the current board state is unchanged.
  • The two observed archetypes suggest LLMs can be prompted or fine-tuned toward more balanced play by targeting the missing half of the strategy.
  • LudoBench offers a lightweight testbed that can track whether future models close the 40-46% agreement gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar spot-based testing could expose comparable blind spots in other stochastic games or real-world planning tasks that involve uncertainty and opponents.
  • The prompt sensitivity finding implies that history-aware prompting may be needed for reliable multi-turn decision making.
  • Full-game win rates against the game-theory agent would reveal whether the spot-level archetypes compound into overall performance differences.

Load-bearing premise

The 480 spot scenarios isolate the intended strategic choices without interference from random dice outcomes or ongoing multi-player interactions.

What would settle it

Run the same six models on complete Ludo matches against the Expectiminimax agent and count how often their moves match the baseline across repeated trials of the same starting positions.

Figures

Figures reproduced from arXiv: 2604.05681 by Dhruv Kumar, Ojas Jain.

Figure 2
Figure 2. Figure 2: Agent Head-to-Head Win-Rate Matrix. Win rates over 200 games confirm the baseline skill ladder: Random < Heuris￾tic < GT. GT’s 59% vs. Heuristic validates depth-limited search advantage [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: GT Alignment Score (move agree￾ment with GT agent) by model and category. Dark = high agreement; light = systematic disagreement. Rule compliance does not simply depend on model size. Gemma-3-12B-IT (10% invalid) performs worse than the much larger Qwen-Plus (4%), suggesting that training methodology matters as much as scale. All metrics are formally defined in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Archetype Clustering: Builder versus Finisher. Each model is plotted by development tendency (x-axis) versus com￾pletion tendency (y-axis). The GT agent (star) exhibits both tendencies. LLMs split into finishers (top-left) and builders (bottom￾right) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Persona Alignment Heatmap. Each cell shows how strongly behavior shifts to match the persona (1.0 = perfect). Most scores fall between 0.3 and 0.5, indi￾cating weak effects. Only Q7B-aggressive (0.93) and QP-greedy (0.83) show strong alignment [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example spot entries. (a) A capture vs safe scenario: Player 1 (LLM) with dice=6 can either capture Player 0’s token at square 49 or move to safety. (b) A grudge pair: identical board state with neutral ( a) and grudge ( b) history framing. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning complexity. LudoBench comprises 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, each isolating a specific strategic choice. We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents. The game-theory agent uses Expectiminimax search with depth-limited lookahead to provide a principled strategic ceiling beyond greedy heuristics. Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability. LudoBench provides a lightweight and interpretable framework for benchmarking LLM strategic reasoning under uncertainty. All code, the spot dataset (480 entries) and model outputs are available at https://anonymous.4open.science/r/LudoBench-5CBF/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo consisting of 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, together with a 4-player simulator supporting Random, Heuristic, Expectiminimax (game-theory), and LLM agents. Evaluation of six models from four families shows 40-46% agreement with the Expectiminimax baseline; models divide into 'finisher' and 'builder' archetypes that each capture only half the baseline strategy; and models exhibit measurable shifts under history-conditioned grudge framing on identical board states. Code, the spot dataset, and model outputs are released.

Significance. If the handcrafted scenarios validly isolate the targeted strategic choices without confounding from dice expectations or opponent modeling, the work would demonstrate concrete limitations in current LLMs for stochastic multi-agent planning, including incomplete strategy capture and prompt sensitivity. The released simulator and dataset constitute a reproducible, lightweight framework that could support future controlled studies of LLM decision-making under uncertainty.

major comments (2)
  1. [§3] §3 (LudoBench construction) and abstract: the central claim that each of the 480 scenarios 'isolates a specific strategic choice' is load-bearing for the reported 40-46% agreement rates, archetype split, and grudge-framing results, yet the manuscript provides no explicit validation that move values have been computed as expectations over the dice distribution or under consistent opponent-policy assumptions. In Ludo, even fixed board states require such expectations; without this, the measured divergences could be artifacts of implicit single-agent or deterministic assumptions embedded in the handcrafting.
  2. [Results] Results section: the paper reports archetype classifications and behavioral shifts but contains no statistical tests for the significance of the 40-46% agreement figures, no per-category error breakdowns, and no description of the exact prompting templates or temperature settings used for the LLM agents. These omissions prevent assessment of whether the archetype and sensitivity claims are robustly supported by the data.
minor comments (2)
  1. [Simulator] The Expectiminimax implementation description would benefit from stating the exact search depth and any move-ordering or pruning used, to aid exact reproduction of the baseline.
  2. [§3] An illustrative example of one spot scenario from each of the 12 categories in the main text (rather than only in the released dataset) would improve interpretability of the decision categories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be incorporated to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (LudoBench construction) and abstract: the central claim that each of the 480 scenarios 'isolates a specific strategic choice' is load-bearing for the reported 40-46% agreement rates, archetype split, and grudge-framing results, yet the manuscript provides no explicit validation that move values have been computed as expectations over the dice distribution or under consistent opponent-policy assumptions. In Ludo, even fixed board states require such expectations; without this, the measured divergences could be artifacts of implicit single-agent or deterministic assumptions embedded in the handcrafting.

    Authors: We agree that the manuscript would benefit from greater explicitness on this point. The 480 scenarios were constructed by first running the Expectiminimax agent (which computes full expectations over the dice distribution at each node and assumes consistent opponent policies via the shared simulator) on candidate board states, then selecting states where the optimal move differs from heuristic baselines in a targeted behavioral dimension. To address the concern directly, we will expand §3 with a new subsection describing the Expectiminimax implementation, including the stochastic rollout procedure for dice expectations and the fixed opponent-policy assumptions. We will also add one worked example per decision category showing the board state, the dice distribution, the computed move values, and why the chosen move isolates the intended choice. These additions will make the isolation claim verifiable without altering the reported results. revision: yes

  2. Referee: [Results] Results section: the paper reports archetype classifications and behavioral shifts but contains no statistical tests for the significance of the 40-46% agreement figures, no per-category error breakdowns, and no description of the exact prompting templates or temperature settings used for the LLM agents. These omissions prevent assessment of whether the archetype and sensitivity claims are robustly supported by the data.

    Authors: We accept that these details were insufficiently reported. In the revised manuscript we will add (1) binomial or chi-squared statistical tests with p-values for the overall 40-46% agreement rates against a random baseline, (2) a supplementary table breaking down agreement and error types by the 12 decision categories to support the archetype analysis, and (3) the precise prompting templates (including the history-conditioned grudge framing) together with temperature settings (0.0 for all models to ensure reproducibility) in a new appendix. These changes require only additional text and tables; no new experiments are needed. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LudoBench evaluation framework

full rationale

The paper introduces handcrafted spot scenarios and an independent Expectiminimax game-theory baseline implemented in a contributed simulator; the reported 40-46% agreement rates, behavioral archetypes, and grudge-framing shifts are direct empirical measurements of LLM outputs against this external baseline. No self-citations, fitted parameters, or self-referential definitions appear in the derivation chain, and the results do not reduce by construction to the inputs via any of the enumerated patterns. The framework is self-contained as a benchmark comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical benchmark contribution relying on standard Ludo game rules and established Expectiminimax search; no free parameters, axioms beyond domain standards, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5545 in / 1190 out tokens · 64506 ms · 2026-05-10T19:08:02.990493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    Player 2 captured your piece last turn despite having a safer option

    doi: 10.1109/CIG.2012.6374142. Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018. Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. InScience, volume 365, pp. 885–890, 2019. Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. Deep Blue.Artifi...

  2. [2]

    The heuristic always captures when possible

    Capture( +100): Any legal capture on a non-safe square dominates all other move types. The heuristic always captures when possible

  3. [3]

    Home path progress(52 +): Entering or advancing in the home path is valued above all non-capture main-board moves

  4. [4]

    Leave base(50): Bringing a piece out of base is valued slightly below home entry but above most main-board advancement

  5. [5]

    Safe square( +20): Landing on a safe square provides a moderate defensive bonus, but does not override capture or home-entry incentives

  6. [6]

    17 Preprint

    Main-board advancement(0 –51): Basic forward progress, with higher scores for posi- tions further from start. 17 Preprint. Under review. G LLM Prompt Template The following is the complete prompt template provided to the LLM agent, repro- duced from llm agent.py. Fields in [CAPS] are populated dynamically from the spot configuration. In spot evaluation mo...

  7. [7]

    Main circular board: 52 squares (0-51)

  8. [8]

    Each player has a fixed START square

  9. [9]

    Tokens move forward relative to START

  10. [10]

    After one full lap (52 steps), tokens enter that player's HOME PATH

  11. [11]

    Each player has a UNIQUE HOME PATH (>= 52)

  12. [12]

    Final home position is HOME_END

  13. [13]

    Tokens must land EXACTLY on HOME_END

  14. [14]

    Overshooting HOME_END is illegal. -------------------- TOKEN STATES -------------------- - Position = -1 : Token is in base - Position 0-51 : Token is on main board - Position >= 52 : Token is in home path - HOME_END : Token has finished -------------------- GAME RULES --------------------

  15. [15]

    All tokens start in base (-1)

  16. [16]

    Leave base ONLY on dice = 6

  17. [17]

    Leaving base places token at START square

  18. [18]

    Tokens move forward by dice value

  19. [19]

    No stacking (one token per square)

  20. [20]

    CAPTURE: land on opponent on non-safe square -> opponent sent to base (-1)

  21. [21]

    Captures NEVER happen on safe squares

  22. [22]

    Safe squares protect tokens from capture

  23. [23]

    Rolling 6 grants an extra turn

  24. [24]

    No legal move -> turn skipped

  25. [25]

    First to move ALL tokens to HOME_END wins

  26. [26]

    id": "cvs_2p_001

    Home paths are private to each player. 18 Preprint. Under review. -------------------- CURRENT GAME STATE -------------------- Number of players: [NUM_PLAYERS] Active player ids: [PLAYER_IDS] Dice rolled: [DICE] Your token positions (Player [PLAYER_ID]): [YOUR_TOKENS] Other players'token positions: [OTHER_PLAYERS_TOKENS] Your start square: [START_POS] You...