pith. sign in

arxiv: 2605.22095 · v1 · pith:GEMIXPY3new · submitted 2026-05-21 · 💰 econ.GN · cs.AI· cs.GT· cs.HC· q-fin.EC

Not Yet: Humans Outperform LLMs in a Colonel Blotto Tournament

Pith reviewed 2026-05-22 02:35 UTC · model grok-4.3

classification 💰 econ.GN cs.AIcs.GTcs.HCq-fin.EC
keywords Colonel Blottolarge language modelsstrategic behaviortournamentshuman-AI comparisonresource allocationheuristics
0
0 comments X

The pith

Humans outperform LLMs in Colonel Blotto tournaments by using better-calibrated intermediate allocation heuristics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper organizes round-robin tournaments in the Colonel Blotto game, first among over 200 humans, then with several popular LLMs submitting strategies, and finally a direct comparison with matched numbers of entries. Humans win more often because they apply allocation rules that balance resources across battlefields in ways better matched to the game's structure, while LLMs more often repeat simpler or fixed patterns. The results matter because the game has a high-dimensional action space and lacks pure strategy Nash equilibria, so it tests whether models can handle open-ended strategic choices. If the pattern holds, current LLMs have not reached human performance levels in this type of interaction even when both sides receive the same rules.

Core claim

Humans more often employ better-calibrated intermediate-level allocation heuristics and outperform the simpler, more stereotyped strategies submitted by LLMs. Strategic sophistication is key to success if and only if the necessary level of reasoning depth is reached, while lower and higher levels of reasoning offer no clear advantage over the primitive strategies. Among humans, field of study weakly predicts success, with STEM backgrounds performing better. Humans almost do not adjust their strategies across tournaments with different sets of opponents.

What carries the argument

The round-robin tournament format in the Colonel Blotto game, in which each player divides a fixed total resource across multiple independent battlefields and wins a battlefield by assigning strictly more than the opponent.

If this is right

  • Strategic sophistication improves outcomes only when it reaches an intermediate depth of reasoning.
  • Humans base their allocations mainly on the game's rules rather than on whether opponents are humans or LLMs.
  • STEM background gives a small edge in human-only play but does not dominate overall results.
  • LLM strategies remain more repetitive and less responsive to the absence of a pure equilibrium.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gap may shrink if LLMs receive training signals that reward intermediate calibration rather than surface-level patterns.
  • The same tournament design could be applied to other multi-battle resource games to test whether the human advantage generalizes.
  • Prompt engineering alone might not close the difference if the core limitation is in how models sample from high-dimensional strategy spaces.

Load-bearing premise

The strategies that LLMs submitted under the fixed prompts used in the second and third tournaments fairly represent what current models can produce, and the matching isolates the effect of strategy quality itself.

What would settle it

Running the same tournament format with newer LLMs or with prompts that include successful human examples and finding that the performance gap disappears would show the claim does not hold under those conditions.

Figures

Figures reproduced from arXiv: 2605.22095 by Alexey Savvateev, Dmitry Dagaev, Egor Ivanov, Gleb Vasiliev, Petr Parshakov.

Figure 1
Figure 1. Figure 1: presents the age distribution of the participants. The majority of respondents are relatively young, with a concentration in the early twenties. The mean age is 28.3 years (SD = 11.3), and the median age is 23 years. Participants’ ages range from 10 to 61 years (N = 215). The distribution is moderately right-skewed due to the presence of several older respondents [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Survival rate: Human (Tournament 1) [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Survival rate: Human vs LLM (Tournament 2) [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Survival rate: Human vs LLM (Tournament 3) [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
read the original abstract

The emergence of large language models (LLMs) has spurred economists to study how humans and LLMs behave in strategic settings. We organized a series of round-robin tournaments in the Colonel Blotto game. This game attracts game theorists' attention due to high-dimensional action space and the absence of pure strategy Nash equilibria. In the first tournament, more than 200 human participants competed against one another. In the second tournament, several popular LLMs were invited to submit strategies. In the third tournament, we matched the number of LLM strategies to the number submitted by humans. We find that humans more often employ better-calibrated intermediate-level allocation heuristics and outperform the simpler, more stereotyped strategies submitted by LLMs. Strategic sophistication is key to success if and only if the necessary level of reasoning depth is reached, while lower and higher levels of reasoning offer no clear advantage over the primitive strategies. Among humans, field of study weakly predicts success: participants with STEM backgrounds perform better in the first tournament. Surprisingly, humans almost do not adjust their strategies across tournaments with different sets of opponents. This result suggests that humans base their choices primarily on the game's rules rather than on the identity of their opponents, treating LLMs much like human competitors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports results from three round-robin Colonel Blotto tournaments. Tournament 1 involves over 200 human participants. Tournament 2 invites several popular LLMs to submit strategies. Tournament 3 matches the number of LLM strategies to the human submissions. The central claim is that humans more often employ better-calibrated intermediate-level allocation heuristics and outperform the simpler, more stereotyped strategies submitted by LLMs. Strategic sophistication confers an advantage only when the necessary reasoning depth is reached; lower and higher levels offer no clear benefit. Among humans, STEM background weakly predicts success in Tournament 1. Humans show almost no adjustment of strategies across tournaments with different opponent sets, suggesting choices are driven primarily by game rules rather than opponent identity.

Significance. If the empirical patterns hold after methodological clarification, the work provides a useful data point in the literature comparing human and LLM strategic behavior in high-dimensional games without pure-strategy Nash equilibria. The observation that humans rely on intermediate heuristics while LLMs produce more stereotyped allocations, together with the lack of human adaptation to LLM opponents, could inform both behavioral game theory and assessments of current LLM reasoning limits. The paper's strength lies in its direct tournament design; its contribution would be strengthened by reproducible elicitation protocols.

major comments (2)
  1. [Methods (LLM strategy elicitation)] Methods section on LLM tournaments (Tournaments 2 and 3): the manuscript provides no details on the exact prompts used to elicit strategies from each LLM, the number of strategies generated per model, whether single-shot or multi-sample generation was employed, or any selection criteria applied to the submitted allocations. This information is load-bearing for the central claim that LLMs produce simpler, stereotyped strategies; without it, the performance gap could reflect elicitation constraints rather than inherent model limitations, directly affecting the interpretation that humans outperform due to better-calibrated heuristics.
  2. [Results (performance and heuristic analysis)] Results on performance comparisons: the abstract and reported findings describe qualitative patterns of human outperformance and heuristic differences but do not report sample sizes for LLM submissions in Tournament 2, the exact statistical tests or effect sizes used to establish superiority, or adjustments for multiple comparisons. Given the typically small number of distinct LLMs available, these omissions make it impossible to assess whether the observed differences are statistically reliable or sensitive to small-sample artifacts.
minor comments (2)
  1. [Abstract] The abstract states that 'several popular LLMs' were used but does not name the specific models or versions; this detail should be added for reproducibility.
  2. [Discussion] The claim that humans 'almost do not adjust their strategies' would be strengthened by a quantitative measure (e.g., average Euclidean distance or correlation between allocation vectors across tournaments) rather than a qualitative statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. These suggestions will help enhance the transparency and robustness of our analysis. Below, we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: Methods section on LLM tournaments (Tournaments 2 and 3): the manuscript provides no details on the exact prompts used to elicit strategies from each LLM, the number of strategies generated per model, whether single-shot or multi-sample generation was employed, or any selection criteria applied to the submitted allocations. This information is load-bearing for the central claim that LLMs produce simpler, stereotyped strategies; without it, the performance gap could reflect elicitation constraints rather than inherent model limitations, directly affecting the interpretation that humans outperform due to better-calibrated heuristics.

    Authors: We agree with the referee that providing comprehensive details on the LLM strategy elicitation process is crucial for the validity and interpretability of our findings. In the revised version of the manuscript, we will include a detailed description of the prompts used for each LLM, specify the number of strategies generated per model, clarify whether single-shot or multi-sample generation was employed, and outline any selection criteria applied to the submitted allocations. This addition will allow for a clearer assessment of whether the performance differences are due to inherent model limitations or the specific elicitation methods used. revision: yes

  2. Referee: Results on performance comparisons: the abstract and reported findings describe qualitative patterns of human outperformance and heuristic differences but do not report sample sizes for LLM submissions in Tournament 2, the exact statistical tests or effect sizes used to establish superiority, or adjustments for multiple comparisons. Given the typically small number of distinct LLMs available, these omissions make it impossible to assess whether the observed differences are statistically reliable or sensitive to small-sample artifacts.

    Authors: We acknowledge that greater statistical detail is necessary to substantiate the performance comparisons. In the revised manuscript, we will explicitly report the sample sizes for LLM submissions in Tournament 2, describe the exact statistical tests and effect sizes used to compare human and LLM performance, and indicate any adjustments made for multiple comparisons. These additions will help evaluate the reliability of the observed differences, particularly in light of the limited number of LLMs typically available. revision: yes

Circularity Check

0 steps flagged

Purely empirical tournament study with no derivation chain

full rationale

The paper reports outcomes from three round-robin Colonel Blotto tournaments (one human, two LLM) and compares submitted strategies and win rates. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described structure. All findings rest on direct empirical observation of allocation heuristics and performance gaps. No self-citation load-bearing steps, self-definitional constructs, or renamings of known results are present. The central claim that humans employ better-calibrated intermediate heuristics is a descriptive summary of the data, not a reduction to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical comparison with no theoretical derivation; it relies on the standard definition of the Colonel Blotto game and the assumption that submitted strategies reflect model capabilities under fixed prompts.

axioms (1)
  • domain assumption Colonel Blotto game has high-dimensional action space and no pure strategy Nash equilibria
    Invoked in the abstract as background for why the game is interesting to game theorists.

pith-pipeline@v0.9.0 · 5776 in / 1240 out tokens · 43608 ms · 2026-05-22T02:35:46.324882+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [4]

    Each state can be visited any integer number of times from 0 to 100

    Each of the two candidates simultaneously and independently decides how many times and to which states to travel. Each state can be visited any integer number of times from 0 to 100

  2. [5]

    For winning a state, a candidate receives 1 electoral vote

    In each state, the candidate who visited that state more times wins. For winning a state, a candidate receives 1 electoral vote. If both candidates visited a state the same number of times, the result in that state is a draw, and both receive 0.5 votes

  3. [6]

    A": <int>,

    The president is the candidate who receives more electoral votes. If both candidates receive the same number of votes, they toss a fair coin at the Central Election Commission, i.e., each becomes president with probability 0.5. Tournament 1Please indicate, for each of the nine states A, B, C, D, E, F, G, H, I, how many trips you will make to that state. I...

  4. [12]

    A": <number of trips>,

    Thepresidentisthecandidatewhoreceivesmoreelectoralvotes. Thewinnerreceives 1 point. If the candidates receive the same number of electoral votes, each receives 0.5 points. You are one of the candidates. Please indicate, for each of the nine states A, B, C, D, E, F, G, H, I, how many trips you will make to that state. In total, you may make no more than 10...

  5. [13]

    Two candidates compete for the presidency of a fictional overseas country

  6. [14]

    The overseas country has 9 states: A, B, C, D, E, F, G, H, I

  7. [15]

    Each candidate has resources for 100 campaign trips

  8. [16]

    Each state may be visited any integer number of times from 0 to 100

    Each of the two candidates simultaneously and independently decides how many times and to which states to travel. Each state may be visited any integer number of times from 0 to 100

  9. [17]

    For winning each of the 9 states, the candidate receives 1 electoral vote

    In each state, the candidate who visited that state more times wins. For winning each of the 9 states, the candidate receives 1 electoral vote. If the candidates visited a given state the same number of times, the election in that state ends in a tie, and both players receive 0.5 electoral votes

  10. [18]

    A": <number of trips>,

    Thepresidentisthecandidatewhoreceivesmoreelectoralvotes. Thewinnerreceives 1 point. If the candidates receive the same number of electoral votes, each receives 0.5 points. You are one of the candidates. Please indicate, for each of the nine states A, B, C, D, E, F, G, H, I, how many trips you will make to that state. In total, you may make no more than 10...