LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

Dhruv Kumar; Ojas Jain

arxiv: 2604.05681 · v1 · submitted 2026-04-07 · 💻 cs.AI · cs.CL· cs.GT· cs.LG· cs.MA

LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

Ojas Jain , Dhruv Kumar This is my paper

Pith reviewed 2026-05-10 19:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.GTcs.LGcs.MA

keywords LudoLLM evaluationstrategic reasoningboard gamesbehavioral archetypesprompt sensitivitygame theory baselinestochastic planning

0 comments

The pith

LLMs agree with game-theory optimal play in Ludo only 40-46% of the time and split into two incomplete behavioral archetypes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LudoBench, a set of 480 handcrafted board positions in the dice-driven game Ludo, each designed to test one narrow strategic decision such as whether to capture an opponent or advance toward home. A game-theory agent that performs limited lookahead search supplies a clear baseline for good play. Six LLMs from four families are tested on these positions and match the baseline only 40-46 percent of the time. The models fall into two consistent patterns: finishers that push pieces home but leave others undeveloped, and builders that move pieces forward but never complete them. The same models also change their choices on identical boards when the prompt includes a history of past captures, showing sensitivity to framing.

Core claim

Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability.

What carries the argument

480 handcrafted spot scenarios grouped into 12 behaviorally distinct decision categories, each isolating one strategic choice in Ludo while holding other variables fixed.

If this is right

LLM strategic reasoning in stochastic multi-agent settings covers only part of the optimal policy and leaves systematic gaps.
Prompts that include past opponent actions can shift decisions even when the current board state is unchanged.
The two observed archetypes suggest LLMs can be prompted or fine-tuned toward more balanced play by targeting the missing half of the strategy.
LudoBench offers a lightweight testbed that can track whether future models close the 40-46% agreement gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar spot-based testing could expose comparable blind spots in other stochastic games or real-world planning tasks that involve uncertainty and opponents.
The prompt sensitivity finding implies that history-aware prompting may be needed for reliable multi-turn decision making.
Full-game win rates against the game-theory agent would reveal whether the spot-level archetypes compound into overall performance differences.

Load-bearing premise

The 480 spot scenarios isolate the intended strategic choices without interference from random dice outcomes or ongoing multi-player interactions.

What would settle it

Run the same six models on complete Ludo matches against the Expectiminimax agent and count how often their moves match the baseline across repeated trials of the same starting positions.

Figures

Figures reproduced from arXiv: 2604.05681 by Dhruv Kumar, Ojas Jain.

**Figure 2.** Figure 2: Agent Head-to-Head Win-Rate Matrix. Win rates over 200 games confirm the baseline skill ladder: Random < Heuristic < GT. GT’s 59% vs. Heuristic validates depth-limited search advantage [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: GT Alignment Score (move agreement with GT agent) by model and category. Dark = high agreement; light = systematic disagreement. Rule compliance does not simply depend on model size. Gemma-3-12B-IT (10% invalid) performs worse than the much larger Qwen-Plus (4%), suggesting that training methodology matters as much as scale. All metrics are formally defined in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Archetype Clustering: Builder versus Finisher. Each model is plotted by development tendency (x-axis) versus completion tendency (y-axis). The GT agent (star) exhibits both tendencies. LLMs split into finishers (top-left) and builders (bottomright) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 8.** Figure 8: Persona Alignment Heatmap. Each cell shows how strongly behavior shifts to match the persona (1.0 = perfect). Most scores fall between 0.3 and 0.5, indicating weak effects. Only Q7B-aggressive (0.93) and QP-greedy (0.83) show strong alignment [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Example spot entries. (a) A capture vs safe scenario: Player 1 (LLM) with dice=6 can either capture Player 0’s token at square 49 or move to safety. (b) A grudge pair: identical board state with neutral ( a) and grudge ( b) history framing. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

read the original abstract

We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning complexity. LudoBench comprises 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, each isolating a specific strategic choice. We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents. The game-theory agent uses Expectiminimax search with depth-limited lookahead to provide a principled strategic ceiling beyond greedy heuristics. Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability. LudoBench provides a lightweight and interpretable framework for benchmarking LLM strategic reasoning under uncertainty. All code, the spot dataset (480 entries) and model outputs are available at https://anonymous.4open.science/r/LudoBench-5CBF/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LudoBench adds a new game benchmark with 480 scenarios and an Expectiminimax baseline, showing low model agreement and behavioral archetypes, but the handcrafted spots likely do not cleanly isolate strategies from dice expectations and opponent effects.

read the letter

The main things to know about this paper are that it introduces LudoBench, a new set of 480 spot scenarios for Ludo to test LLM strategic choices, along with a simulator and an Expectiminimax agent, and reports that models only agree with the baseline 40-46% of the time while showing distinct archetypes and prompt sensitivity. It also splits models into finishers who complete pieces but skip development and builders who develop but never finish, plus some shifts under grudge framing on identical states. The simulator supports random, heuristic, game-theory, and LLM agents, and everything is released with code and data. This is genuinely new material for LLM evaluation in stochastic multi-agent settings. The open resources and the concrete archetype observations stand out as useful additions that prior game benchmarks have not emphasized in the same way. The grudge test on fixed boards is a straightforward way to surface prompt sensitivity in a planning context. Having a depth-limited Expectiminimax baseline gives a clearer ceiling than pure heuristics. The soft spots center on whether the 480 scenarios actually isolate the 12 intended decision categories. Ludo moves are expectations over dice distributions and require some model of other players, so handcrafting spots without explicit controls for those factors could let confounds leak into the agreement rates and archetype splits. The abstract gives little on scenario validation, exact prompting, or any statistical checks, which leaves the claims harder to assess. If the full paper has more on how they ensured each spot's optimal action is unambiguous, that would help. This is for researchers building or using benchmarks for LLM planning under uncertainty, especially those who want interpretable game scenarios rather than full-game rollouts. A reader focused on evaluation methods would get value from the setup and the released assets. It has enough new elements and grounding to deserve a serious referee, though it would benefit from tighter methodology details. I recommend sending it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo consisting of 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, together with a 4-player simulator supporting Random, Heuristic, Expectiminimax (game-theory), and LLM agents. Evaluation of six models from four families shows 40-46% agreement with the Expectiminimax baseline; models divide into 'finisher' and 'builder' archetypes that each capture only half the baseline strategy; and models exhibit measurable shifts under history-conditioned grudge framing on identical board states. Code, the spot dataset, and model outputs are released.

Significance. If the handcrafted scenarios validly isolate the targeted strategic choices without confounding from dice expectations or opponent modeling, the work would demonstrate concrete limitations in current LLMs for stochastic multi-agent planning, including incomplete strategy capture and prompt sensitivity. The released simulator and dataset constitute a reproducible, lightweight framework that could support future controlled studies of LLM decision-making under uncertainty.

major comments (2)

[§3] §3 (LudoBench construction) and abstract: the central claim that each of the 480 scenarios 'isolates a specific strategic choice' is load-bearing for the reported 40-46% agreement rates, archetype split, and grudge-framing results, yet the manuscript provides no explicit validation that move values have been computed as expectations over the dice distribution or under consistent opponent-policy assumptions. In Ludo, even fixed board states require such expectations; without this, the measured divergences could be artifacts of implicit single-agent or deterministic assumptions embedded in the handcrafting.
[Results] Results section: the paper reports archetype classifications and behavioral shifts but contains no statistical tests for the significance of the 40-46% agreement figures, no per-category error breakdowns, and no description of the exact prompting templates or temperature settings used for the LLM agents. These omissions prevent assessment of whether the archetype and sensitivity claims are robustly supported by the data.

minor comments (2)

[Simulator] The Expectiminimax implementation description would benefit from stating the exact search depth and any move-ordering or pruning used, to aid exact reproduction of the baseline.
[§3] An illustrative example of one spot scenario from each of the 12 categories in the main text (rather than only in the released dataset) would improve interpretability of the decision categories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be incorporated to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (LudoBench construction) and abstract: the central claim that each of the 480 scenarios 'isolates a specific strategic choice' is load-bearing for the reported 40-46% agreement rates, archetype split, and grudge-framing results, yet the manuscript provides no explicit validation that move values have been computed as expectations over the dice distribution or under consistent opponent-policy assumptions. In Ludo, even fixed board states require such expectations; without this, the measured divergences could be artifacts of implicit single-agent or deterministic assumptions embedded in the handcrafting.

Authors: We agree that the manuscript would benefit from greater explicitness on this point. The 480 scenarios were constructed by first running the Expectiminimax agent (which computes full expectations over the dice distribution at each node and assumes consistent opponent policies via the shared simulator) on candidate board states, then selecting states where the optimal move differs from heuristic baselines in a targeted behavioral dimension. To address the concern directly, we will expand §3 with a new subsection describing the Expectiminimax implementation, including the stochastic rollout procedure for dice expectations and the fixed opponent-policy assumptions. We will also add one worked example per decision category showing the board state, the dice distribution, the computed move values, and why the chosen move isolates the intended choice. These additions will make the isolation claim verifiable without altering the reported results. revision: yes
Referee: [Results] Results section: the paper reports archetype classifications and behavioral shifts but contains no statistical tests for the significance of the 40-46% agreement figures, no per-category error breakdowns, and no description of the exact prompting templates or temperature settings used for the LLM agents. These omissions prevent assessment of whether the archetype and sensitivity claims are robustly supported by the data.

Authors: We accept that these details were insufficiently reported. In the revised manuscript we will add (1) binomial or chi-squared statistical tests with p-values for the overall 40-46% agreement rates against a random baseline, (2) a supplementary table breaking down agreement and error types by the 12 decision categories to support the archetype analysis, and (3) the precise prompting templates (including the history-conditioned grudge framing) together with temperature settings (0.0 for all models to ensure reproducibility) in a new appendix. These changes require only additional text and tables; no new experiments are needed. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LudoBench evaluation framework

full rationale

The paper introduces handcrafted spot scenarios and an independent Expectiminimax game-theory baseline implemented in a contributed simulator; the reported 40-46% agreement rates, behavioral archetypes, and grudge-framing shifts are direct empirical measurements of LLM outputs against this external baseline. No self-citations, fitted parameters, or self-referential definitions appear in the derivation chain, and the results do not reduce by construction to the inputs via any of the enumerated patterns. The framework is self-contained as a benchmark comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical benchmark contribution relying on standard Ludo game rules and established Expectiminimax search; no free parameters, axioms beyond domain standards, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5545 in / 1190 out tokens · 64506 ms · 2026-05-10T19:08:02.990493+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Player 2 captured your piece last turn despite having a safer option

doi: 10.1109/CIG.2012.6374142. Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018. Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. InScience, volume 365, pp. 885–890, 2019. Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. Deep Blue.Artifi...

work page doi:10.1109/cig.2012.6374142 2012
[2]

The heuristic always captures when possible

Capture( +100): Any legal capture on a non-safe square dominates all other move types. The heuristic always captures when possible

work page
[3]

Home path progress(52 +): Entering or advancing in the home path is valued above all non-capture main-board moves

work page
[4]

Leave base(50): Bringing a piece out of base is valued slightly below home entry but above most main-board advancement

work page
[5]

Safe square( +20): Landing on a safe square provides a moderate defensive bonus, but does not override capture or home-entry incentives

work page
[6]

17 Preprint

Main-board advancement(0 –51): Basic forward progress, with higher scores for posi- tions further from start. 17 Preprint. Under review. G LLM Prompt Template The following is the complete prompt template provided to the LLM agent, repro- duced from llm agent.py. Fields in [CAPS] are populated dynamically from the spot configuration. In spot evaluation mo...

work page
[7]

Main circular board: 52 squares (0-51)

work page
[8]

Each player has a fixed START square

work page
[9]

Tokens move forward relative to START

work page
[10]

After one full lap (52 steps), tokens enter that player's HOME PATH

work page
[11]

Each player has a UNIQUE HOME PATH (>= 52)

work page
[12]

Final home position is HOME_END

work page
[13]

Tokens must land EXACTLY on HOME_END

work page
[14]

Overshooting HOME_END is illegal. -------------------- TOKEN STATES -------------------- - Position = -1 : Token is in base - Position 0-51 : Token is on main board - Position >= 52 : Token is in home path - HOME_END : Token has finished -------------------- GAME RULES --------------------

work page
[15]

All tokens start in base (-1)

work page
[16]

Leave base ONLY on dice = 6

work page
[17]

Leaving base places token at START square

work page
[18]

Tokens move forward by dice value

work page
[19]

No stacking (one token per square)

work page
[20]

CAPTURE: land on opponent on non-safe square -> opponent sent to base (-1)

work page
[21]

Captures NEVER happen on safe squares

work page
[22]

Safe squares protect tokens from capture

work page
[23]

Rolling 6 grants an extra turn

work page
[24]

No legal move -> turn skipped

work page
[25]

First to move ALL tokens to HOME_END wins

work page
[26]

id": "cvs_2p_001

Home paths are private to each player. 18 Preprint. Under review. -------------------- CURRENT GAME STATE -------------------- Number of players: [NUM_PLAYERS] Active player ids: [PLAYER_IDS] Dice rolled: [DICE] Your token positions (Player [PLAYER_ID]): [YOUR_TOKENS] Other players'token positions: [OTHER_PLAYERS_TOKENS] Your start square: [START_POS] You...

work page

[1] [1]

Player 2 captured your piece last turn despite having a safer option

doi: 10.1109/CIG.2012.6374142. Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018. Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. InScience, volume 365, pp. 885–890, 2019. Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. Deep Blue.Artifi...

work page doi:10.1109/cig.2012.6374142 2012

[2] [2]

The heuristic always captures when possible

Capture( +100): Any legal capture on a non-safe square dominates all other move types. The heuristic always captures when possible

work page

[3] [3]

Home path progress(52 +): Entering or advancing in the home path is valued above all non-capture main-board moves

work page

[4] [4]

Leave base(50): Bringing a piece out of base is valued slightly below home entry but above most main-board advancement

work page

[5] [5]

Safe square( +20): Landing on a safe square provides a moderate defensive bonus, but does not override capture or home-entry incentives

work page

[6] [6]

17 Preprint

Main-board advancement(0 –51): Basic forward progress, with higher scores for posi- tions further from start. 17 Preprint. Under review. G LLM Prompt Template The following is the complete prompt template provided to the LLM agent, repro- duced from llm agent.py. Fields in [CAPS] are populated dynamically from the spot configuration. In spot evaluation mo...

work page

[7] [7]

Main circular board: 52 squares (0-51)

work page

[8] [8]

Each player has a fixed START square

work page

[9] [9]

Tokens move forward relative to START

work page

[10] [10]

After one full lap (52 steps), tokens enter that player's HOME PATH

work page

[11] [11]

Each player has a UNIQUE HOME PATH (>= 52)

work page

[12] [12]

Final home position is HOME_END

work page

[13] [13]

Tokens must land EXACTLY on HOME_END

work page

[14] [14]

Overshooting HOME_END is illegal. -------------------- TOKEN STATES -------------------- - Position = -1 : Token is in base - Position 0-51 : Token is on main board - Position >= 52 : Token is in home path - HOME_END : Token has finished -------------------- GAME RULES --------------------

work page

[15] [15]

All tokens start in base (-1)

work page

[16] [16]

Leave base ONLY on dice = 6

work page

[17] [17]

Leaving base places token at START square

work page

[18] [18]

Tokens move forward by dice value

work page

[19] [19]

No stacking (one token per square)

work page

[20] [20]

CAPTURE: land on opponent on non-safe square -> opponent sent to base (-1)

work page

[21] [21]

Captures NEVER happen on safe squares

work page

[22] [22]

Safe squares protect tokens from capture

work page

[23] [23]

Rolling 6 grants an extra turn

work page

[24] [24]

No legal move -> turn skipped

work page

[25] [25]

First to move ALL tokens to HOME_END wins

work page

[26] [26]

id": "cvs_2p_001

Home paths are private to each player. 18 Preprint. Under review. -------------------- CURRENT GAME STATE -------------------- Number of players: [NUM_PLAYERS] Active player ids: [PLAYER_IDS] Dice rolled: [DICE] Your token positions (Player [PLAYER_ID]): [YOUR_TOKENS] Other players'token positions: [OTHER_PLAYERS_TOKENS] Your start square: [START_POS] You...

work page