Not Yet: Humans Outperform LLMs in a Colonel Blotto Tournament
Pith reviewed 2026-05-22 02:35 UTC · model grok-4.3
The pith
Humans outperform LLMs in Colonel Blotto tournaments by using better-calibrated intermediate allocation heuristics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Humans more often employ better-calibrated intermediate-level allocation heuristics and outperform the simpler, more stereotyped strategies submitted by LLMs. Strategic sophistication is key to success if and only if the necessary level of reasoning depth is reached, while lower and higher levels of reasoning offer no clear advantage over the primitive strategies. Among humans, field of study weakly predicts success, with STEM backgrounds performing better. Humans almost do not adjust their strategies across tournaments with different sets of opponents.
What carries the argument
The round-robin tournament format in the Colonel Blotto game, in which each player divides a fixed total resource across multiple independent battlefields and wins a battlefield by assigning strictly more than the opponent.
If this is right
- Strategic sophistication improves outcomes only when it reaches an intermediate depth of reasoning.
- Humans base their allocations mainly on the game's rules rather than on whether opponents are humans or LLMs.
- STEM background gives a small edge in human-only play but does not dominate overall results.
- LLM strategies remain more repetitive and less responsive to the absence of a pure equilibrium.
Where Pith is reading between the lines
- The gap may shrink if LLMs receive training signals that reward intermediate calibration rather than surface-level patterns.
- The same tournament design could be applied to other multi-battle resource games to test whether the human advantage generalizes.
- Prompt engineering alone might not close the difference if the core limitation is in how models sample from high-dimensional strategy spaces.
Load-bearing premise
The strategies that LLMs submitted under the fixed prompts used in the second and third tournaments fairly represent what current models can produce, and the matching isolates the effect of strategy quality itself.
What would settle it
Running the same tournament format with newer LLMs or with prompts that include successful human examples and finding that the performance gap disappears would show the claim does not hold under those conditions.
Figures
read the original abstract
The emergence of large language models (LLMs) has spurred economists to study how humans and LLMs behave in strategic settings. We organized a series of round-robin tournaments in the Colonel Blotto game. This game attracts game theorists' attention due to high-dimensional action space and the absence of pure strategy Nash equilibria. In the first tournament, more than 200 human participants competed against one another. In the second tournament, several popular LLMs were invited to submit strategies. In the third tournament, we matched the number of LLM strategies to the number submitted by humans. We find that humans more often employ better-calibrated intermediate-level allocation heuristics and outperform the simpler, more stereotyped strategies submitted by LLMs. Strategic sophistication is key to success if and only if the necessary level of reasoning depth is reached, while lower and higher levels of reasoning offer no clear advantage over the primitive strategies. Among humans, field of study weakly predicts success: participants with STEM backgrounds perform better in the first tournament. Surprisingly, humans almost do not adjust their strategies across tournaments with different sets of opponents. This result suggests that humans base their choices primarily on the game's rules rather than on the identity of their opponents, treating LLMs much like human competitors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports results from three round-robin Colonel Blotto tournaments. Tournament 1 involves over 200 human participants. Tournament 2 invites several popular LLMs to submit strategies. Tournament 3 matches the number of LLM strategies to the human submissions. The central claim is that humans more often employ better-calibrated intermediate-level allocation heuristics and outperform the simpler, more stereotyped strategies submitted by LLMs. Strategic sophistication confers an advantage only when the necessary reasoning depth is reached; lower and higher levels offer no clear benefit. Among humans, STEM background weakly predicts success in Tournament 1. Humans show almost no adjustment of strategies across tournaments with different opponent sets, suggesting choices are driven primarily by game rules rather than opponent identity.
Significance. If the empirical patterns hold after methodological clarification, the work provides a useful data point in the literature comparing human and LLM strategic behavior in high-dimensional games without pure-strategy Nash equilibria. The observation that humans rely on intermediate heuristics while LLMs produce more stereotyped allocations, together with the lack of human adaptation to LLM opponents, could inform both behavioral game theory and assessments of current LLM reasoning limits. The paper's strength lies in its direct tournament design; its contribution would be strengthened by reproducible elicitation protocols.
major comments (2)
- [Methods (LLM strategy elicitation)] Methods section on LLM tournaments (Tournaments 2 and 3): the manuscript provides no details on the exact prompts used to elicit strategies from each LLM, the number of strategies generated per model, whether single-shot or multi-sample generation was employed, or any selection criteria applied to the submitted allocations. This information is load-bearing for the central claim that LLMs produce simpler, stereotyped strategies; without it, the performance gap could reflect elicitation constraints rather than inherent model limitations, directly affecting the interpretation that humans outperform due to better-calibrated heuristics.
- [Results (performance and heuristic analysis)] Results on performance comparisons: the abstract and reported findings describe qualitative patterns of human outperformance and heuristic differences but do not report sample sizes for LLM submissions in Tournament 2, the exact statistical tests or effect sizes used to establish superiority, or adjustments for multiple comparisons. Given the typically small number of distinct LLMs available, these omissions make it impossible to assess whether the observed differences are statistically reliable or sensitive to small-sample artifacts.
minor comments (2)
- [Abstract] The abstract states that 'several popular LLMs' were used but does not name the specific models or versions; this detail should be added for reproducibility.
- [Discussion] The claim that humans 'almost do not adjust their strategies' would be strengthened by a quantitative measure (e.g., average Euclidean distance or correlation between allocation vectors across tournaments) rather than a qualitative statement.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. These suggestions will help enhance the transparency and robustness of our analysis. Below, we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: Methods section on LLM tournaments (Tournaments 2 and 3): the manuscript provides no details on the exact prompts used to elicit strategies from each LLM, the number of strategies generated per model, whether single-shot or multi-sample generation was employed, or any selection criteria applied to the submitted allocations. This information is load-bearing for the central claim that LLMs produce simpler, stereotyped strategies; without it, the performance gap could reflect elicitation constraints rather than inherent model limitations, directly affecting the interpretation that humans outperform due to better-calibrated heuristics.
Authors: We agree with the referee that providing comprehensive details on the LLM strategy elicitation process is crucial for the validity and interpretability of our findings. In the revised version of the manuscript, we will include a detailed description of the prompts used for each LLM, specify the number of strategies generated per model, clarify whether single-shot or multi-sample generation was employed, and outline any selection criteria applied to the submitted allocations. This addition will allow for a clearer assessment of whether the performance differences are due to inherent model limitations or the specific elicitation methods used. revision: yes
-
Referee: Results on performance comparisons: the abstract and reported findings describe qualitative patterns of human outperformance and heuristic differences but do not report sample sizes for LLM submissions in Tournament 2, the exact statistical tests or effect sizes used to establish superiority, or adjustments for multiple comparisons. Given the typically small number of distinct LLMs available, these omissions make it impossible to assess whether the observed differences are statistically reliable or sensitive to small-sample artifacts.
Authors: We acknowledge that greater statistical detail is necessary to substantiate the performance comparisons. In the revised manuscript, we will explicitly report the sample sizes for LLM submissions in Tournament 2, describe the exact statistical tests and effect sizes used to compare human and LLM performance, and indicate any adjustments made for multiple comparisons. These additions will help evaluate the reliability of the observed differences, particularly in light of the limited number of LLMs typically available. revision: yes
Circularity Check
Purely empirical tournament study with no derivation chain
full rationale
The paper reports outcomes from three round-robin Colonel Blotto tournaments (one human, two LLM) and compares submitted strategies and win rates. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described structure. All findings rest on direct empirical observation of allocation heuristics and performance gaps. No self-citation load-bearing steps, self-definitional constructs, or renamings of known results are present. The central claim that humans employ better-calibrated intermediate heuristics is a descriptive summary of the data, not a reduction to any input by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Colonel Blotto game has high-dimensional action space and no pure strategy Nash equilibria
Reference graph
Works this paper leans on
-
[4]
Each state can be visited any integer number of times from 0 to 100
Each of the two candidates simultaneously and independently decides how many times and to which states to travel. Each state can be visited any integer number of times from 0 to 100
-
[5]
For winning a state, a candidate receives 1 electoral vote
In each state, the candidate who visited that state more times wins. For winning a state, a candidate receives 1 electoral vote. If both candidates visited a state the same number of times, the result in that state is a draw, and both receive 0.5 votes
-
[6]
The president is the candidate who receives more electoral votes. If both candidates receive the same number of votes, they toss a fair coin at the Central Election Commission, i.e., each becomes president with probability 0.5. Tournament 1Please indicate, for each of the nine states A, B, C, D, E, F, G, H, I, how many trips you will make to that state. I...
-
[12]
Thepresidentisthecandidatewhoreceivesmoreelectoralvotes. Thewinnerreceives 1 point. If the candidates receive the same number of electoral votes, each receives 0.5 points. You are one of the candidates. Please indicate, for each of the nine states A, B, C, D, E, F, G, H, I, how many trips you will make to that state. In total, you may make no more than 10...
-
[13]
Two candidates compete for the presidency of a fictional overseas country
-
[14]
The overseas country has 9 states: A, B, C, D, E, F, G, H, I
-
[15]
Each candidate has resources for 100 campaign trips
-
[16]
Each state may be visited any integer number of times from 0 to 100
Each of the two candidates simultaneously and independently decides how many times and to which states to travel. Each state may be visited any integer number of times from 0 to 100
-
[17]
For winning each of the 9 states, the candidate receives 1 electoral vote
In each state, the candidate who visited that state more times wins. For winning each of the 9 states, the candidate receives 1 electoral vote. If the candidates visited a given state the same number of times, the election in that state ends in a tie, and both players receive 0.5 electoral votes
-
[18]
Thepresidentisthecandidatewhoreceivesmoreelectoralvotes. Thewinnerreceives 1 point. If the candidates receive the same number of electoral votes, each receives 0.5 points. You are one of the candidates. Please indicate, for each of the nine states A, B, C, D, E, F, G, H, I, how many trips you will make to that state. In total, you may make no more than 10...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.