Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

Arnaud Ricci

arxiv: 2606.24391 · v1 · pith:35XN2LAYnew · submitted 2026-06-23 · 💻 cs.AI · cs.CL· cs.GT· cs.MA

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

Arnaud Ricci This is my paper

Pith reviewed 2026-06-25 23:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.GTcs.MA

keywords LLM benchmarkfog of warstrategic reasoningnuclear strategydiplomacybelief trackingreliabilityadversarial games

0 comments

The pith

LLMs default to mechanical nuclear rushes in a fog-of-war 1v1 benchmark rather than diplomacy or conquest.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Age of LLM as a turn-based 1v1 benchmark on a 13x7 grid where models must destroy an enemy base while handling fog of war, required diplomacy options, and strict JSON action rules. It finds that nuclear rushes dominate play at 78 percent on the rules-coherent sub-corpus, driven by the secret-simultaneous launch mechanics rather than any shortfall in deterrence reasoning. The work further shows that most illegal actions arise from fog or state errors, turning the error rate into a measure of belief tracking, while diplomacy occurs often but produces few agreements. These patterns appear across 54 matches with 15 models and position the turn-by-turn traces as a window into LLM reasoning under adversarial uncertainty.

Core claim

The author establishes that under secret-simultaneous nuclear launch rules in this benchmark, the nuclear rush with a sole launcher dominates play at 78 percent in the rules-coherent subset, and this pattern arises primarily from the mechanical incentives of the rules rather than a failure of models to grasp deterrence. Military conquest remains rare but ends games faster, diplomacy messages proliferate without consummation, and the illegal-action rate largely tracks belief errors from fog of war.

What carries the argument

The private Age of LLM game engine that enforces JSON action schemas, applies fog of war to hide map state, and permits diplomatic messages while keeping uranium secret.

If this is right

Nuclear rush occurs in 78 percent of rules-coherent matches and 85 percent corpus-wide.
Diplomacy messages appear frequently yet almost never produce agreements or ceasefires.
Roughly 58 percent of illegal actions stem from fog or state errors, so the illegal rate serves as a belief-tracking measure.
Military conquest, though uncommon, concludes matches in 12.3 turns versus 18.9 for other paths.
A weak association links higher action reliability to winning outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark's turn-by-turn traces could be mined to characterize distinct model personas or spontaneous deception patterns under uncertainty.
If the nuclear rush is largely mechanical, the same engine could test whether targeted prompting or fine-tuning alters strategy selection when launch rules change.
Future side-swapped or larger-scale runs would allow stronger claims about whether reliability differences predict wins across models.
The focus on belief-tracking errors suggests the setup could serve as a testbed for measuring improvements in LLM world modeling during multi-turn adversarial play.

Load-bearing premise

The assumption that a private engine together with fresh random map seeds and varied opponents fully prevents data contamination so that observed behaviors reflect genuine reasoning rather than memorized patterns.

What would settle it

Re-running the same models on publicly known map seeds or a public version of the engine and checking whether the nuclear-rush rate drops would test whether contamination drives the dominant strategy.

Figures

Figures reproduced from arXiv: 2606.24391 by Arnaud Ricci.

**Figure 2.** Figure 2: (a) Victory channels across 54 completed matches. The nuclear rush accounts for 78% of [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Launch timing across the corpus. All 46 model-issued launches fall in turns 16–23, clustering [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The public live leaderboard at https://ageofllm.org, ranked by points per match. The frozen 54-match snapshot analysed in this paper ( [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Points per match versus illegal-action rate. The negative correlation ( [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Diplomacy funnel (log scale). Free messages dominate traffic (1,804) but carry no binding effect. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: (a) Action budget across the corpus (5,258 actions). Movement dominates (43%); launches are [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Illegal-action rejection reasons (295 total). Red bars are fog/state errors (field-of-view and [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: (a) Opening courtesy rate by model. Courtesy phrases appear in up to 63% of Gemini Flash [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Message tone while performing a significant action. Green = the message truthfully announces [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept secret), and a reliability dimension where every turn must follow a strict JSON schema and an illegal action is silently discarded. The engine is private and each match uses a fresh random map seed and opponent, mitigating the data contamination that affects public benchmarks. Models receive a (near) rule-only prompt with no build-order advice (two tactical seed phrases were present during data collection; see Section 2.7). We benchmark 15 reasoning models across 54 matches and 5,258 actions. Findings: (1) the nuclear rush dominates (78% on the rules-coherent v0.11+ sub-corpus; 85% corpus-wide) with a sole-launcher signature that is largely mechanical under secret-simultaneous launch rules, not a cognitive deterrence failure; (2) military conquest is rare but faster (12.3 vs 18.9 turns); (3) diplomacy is prolific yet almost never consummated; (4) ~58% of illegal actions are fog/state errors, making the illegal-action rate a measure of belief-tracking; (5) -- the least established, and the only one we label exploratory -- a weak link associates reliability with winning. The corpus is small, unbalanced and not side-swapped, so the ranking is a preliminary descriptive view, not a contribution. Beyond ranking, the turn-by-turn traces of actions and messages make the corpus a lens on how LLMs reason under adversarial uncertainty -- their belief-tracking, spontaneous deception, and per-model cognitive "personas" -- which we frame as a future research direction. We release the replay format, an isometric viewer and all replays; engine source on request.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark setup with fog, secret diplomacy and JSON enforcement is the real addition, but 54 matches and untested contamination controls make the nuclear-rush and reliability claims too thin to rely on.

read the letter

The paper's actual novelty is the benchmark itself: a 1v1 grid game that forces fog of war, secret uranium in diplomacy, and mandatory JSON actions with silent discards. That specific mix of stressors is not in the cited prior work, and releasing the replay format plus viewer is a practical step that lets others inspect traces directly.

The authors are straightforward about the limits. They label the reliability-win link exploratory, note the corpus is small and unbalanced, and avoid claiming the ranking is definitive. The turn-by-turn data on belief errors and unconsummated diplomacy could be useful raw material for people studying how models track state or attempt deception.

The soft spots sit where the claims get more specific. The 78% nuclear-rush rate is presented as largely mechanical from the secret-simultaneous rules, yet the only safeguard against contamination is a private engine plus fresh seeds. No ablation shows what happens when the rule text is altered, so it remains possible the models are executing patterns triggered by the prompt rather than reasoning from the mechanics. With only 54 matches and no side-swapping, any pattern is still fragile. The illegal-action breakdown as a belief-tracking signal is reasonable but inherits the same sample-size issue.

This is for researchers building or evaluating LLM game agents who want a new testbed with built-in uncertainty and output constraints. It is worth sending to peer review because the benchmark design is concrete and the released artifacts lower the barrier for follow-up work, even though the current findings need more matches and validation experiments before they can be treated as stable.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Age of LLM benchmark, a turn-based 1v1 game on a 13x7 grid pitting LLMs against each other under fog of war, full diplomacy (including secret uranium and messages/ceasefires/ultimatums), and a reliability constraint requiring strict JSON action schemas with silent discards for illegal moves. From 54 matches across 15 models yielding 5,258 actions, it reports that nuclear rushes dominate (78% on the rules-coherent v0.11+ sub-corpus; 85% corpus-wide) with a sole-launcher signature attributed to secret-simultaneous launch rules rather than cognitive deterrence failure; military conquest is rare but faster (12.3 vs 18.9 turns); diplomacy is prolific yet rarely consummated; ~58% of illegal actions are fog/state errors (positioning illegal-action rate as a belief-tracking measure); and there is a weak exploratory association between reliability and winning. The corpus is explicitly described as small, unbalanced, and not side-swapped, so rankings are preliminary; the turn-by-turn traces are framed as a resource for studying LLM reasoning, belief-tracking, and deception under uncertainty.

Significance. If the nuclear-rush signature can be isolated as rule-driven rather than arising from pre-trained patterns, the benchmark would provide a useful descriptive lens on LLM behavior in adversarial settings with incomplete information, including spontaneous strategies and per-model personas. The release of the replay format, isometric viewer, and all matches supports reproducibility and follow-on analysis, though the acknowledged small scale and exploratory framing limit claims about model rankings or general reliability.

major comments (2)

[Abstract] Abstract (paragraph on engine and seeds): The assertion that the private engine together with fresh random map seeds 'mitigating the data contamination that affects public benchmarks' is presented without supporting tests, such as whether the 78% nuclear-rush rate and sole-launcher signature persist or disappear under prompts with structurally altered simultaneous-launch or secret-uranium rules. This assumption is load-bearing for the central claim that the behavior is 'largely mechanical under secret-simultaneous launch rules, not a cognitive deterrence failure'.
[Abstract / Findings (1)] Findings (1) and corpus description: The 78% nuclear-rush rate is reported specifically on the 'rules-coherent v0.11+ sub-corpus', yet the manuscript provides no details on the size of this sub-corpus, the exact criteria for 'rules-coherent', or sensitivity checks (e.g., excluding early versions), which is necessary to evaluate robustness given the overall corpus of only 54 matches that is described as small and unbalanced.

minor comments (1)

[Abstract] Abstract: The parenthetical note that 'two tactical seed phrases were present during data collection; see Section 2.7' creates minor ambiguity about the 'near rule-only prompt' description; clarifying the content and impact of these phrases in the main text would improve precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight areas for improving the clarity and robustness of our presentation. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on engine and seeds): The assertion that the private engine together with fresh random map seeds 'mitigating the data contamination that affects public benchmarks' is presented without supporting tests, such as whether the 78% nuclear-rush rate and sole-launcher signature persist or disappear under prompts with structurally altered simultaneous-launch or secret-uranium rules. This assumption is load-bearing for the central claim that the behavior is 'largely mechanical under secret-simultaneous launch rules, not a cognitive deterrence failure'.

Authors: The mitigation of data contamination is indeed presented as a consequence of the private engine and fresh seeds without direct empirical tests on rule variants. However, the central claim that the sole-launcher signature is largely mechanical stems from the specific game mechanics (secret uranium placement and simultaneous launch rules), which we analyze in the manuscript. We acknowledge that testing with altered rules would strengthen this interpretation. In the revision, we will clarify this distinction in the abstract and add a sentence noting that while the design mitigates contamination, the mechanical interpretation relies on rule analysis, and variant testing remains an open direction for future work. revision: partial
Referee: [Abstract / Findings (1)] Findings (1) and corpus description: The 78% nuclear-rush rate is reported specifically on the 'rules-coherent v0.11+ sub-corpus', yet the manuscript provides no details on the size of this sub-corpus, the exact criteria for 'rules-coherent', or sensitivity checks (e.g., excluding early versions), which is necessary to evaluate robustness given the overall corpus of only 54 matches that is described as small and unbalanced.

Authors: We agree that additional details on the sub-corpus are needed for proper evaluation. The v0.11+ sub-corpus consists of matches run with the stabilized rule set starting from version 0.11, which fixed certain implementation issues present in earlier versions. We will include in the revised manuscript the exact size of this sub-corpus (out of the 54 total matches), the precise criteria used to define 'rules-coherent', and note any sensitivity checks or the lack thereof due to the exploratory nature of the study. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical counts from new matches on private engine

full rationale

The paper contains no derivations, equations, fitted parameters, or predictions that reduce to inputs by construction. All reported statistics (e.g., 78% nuclear-rush rate) are direct tallies from 54 fresh matches run on a private engine with random seeds. The methodological claim that the private engine plus seeds mitigates contamination is an unverified assumption but does not constitute a self-referential derivation or load-bearing self-citation chain. No uniqueness theorems, ansatzes, or renamings of known results appear. The work is therefore self-contained as a descriptive benchmark report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical model or derivation; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5899 in / 1124 out tokens · 33828 ms · 2026-06-25T23:47:29.508660+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 1 canonical work pages

[1]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021

2021
[2]

M. Chen, J. Tworek, H. Jun, Q. Yuan,et al.Evaluating large language models trained on code. arXiv preprintarXiv:2107.03374, 2021

Pith/arXiv arXiv 2021
[3]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. InICLR, 2021

2021
[4]

Liang, R

P. Liang, R. Bommasani, T. Lee,et al.Holistic evaluation of language models.Transactions on Machine Learning Research (TMLR), 2023. arXiv:2211.09110

Pith/arXiv arXiv 2023
[5]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InICLR, 2024. arXiv:2310.06770

Pith/arXiv arXiv 2024
[6]

Silver, A

D. Silver, A. Huang, C. J. Maddison,et al.Mastering the game of Go with deep neural networks and tree search.Nature, 529(7587):484–489, 2016

2016
[7]

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom

A. Bakhtin, N. Brown, E. Dinan,et al.Human-level play in the game of Diplomacy by combining language models with strategic reasoning.Science, 378(6624):1067–1074, 2022. doi:10.1126/science.ade9097

work page doi:10.1126/science.ade9097 2022
[8]

X. Liu, H. Yu, H. Zhang,et al.AgentBench: Evaluating LLMs as agents. InICLR, 2024. arXiv:2308.03688. 20

Pith/arXiv arXiv 2024
[9]

Costarelli, M

A. Costarelli, M. Allen, R. Hauksson,et al.GameBench: Evaluating strategic reasoning abilities of LLM agents.arXiv preprintarXiv:2406.06613, 2024

arXiv 2024
[10]

J. Duan, R. Zhang, J. Diffenderfer,et al.GTBench: Uncovering the strategic reasoning limitations of LLMs via game-theoretic evaluations.arXiv preprintarXiv:2402.12348, 2024

arXiv 2024
[11]

de Wynter

A. de Wynter. If LLMs have human-like attributes, then so does Age of Empires II.arXiv preprint arXiv:2605.31514, 2026

Pith/arXiv arXiv 2026
[12]

max_turns

J. Chen, S. Cheng, C. Gurkan, and H. M. A. Fattah. To nuke or not to nuke: LLMs’ (missing) ethical reasoning and actions in a high-stakes decision-making simulation.arXiv preprintarXiv:2606.08310, 2026. A System prompt (verbatim) The complete, unedited system prompt given to every model is reproduced below. It describes only the rules and the JSON action ...

Pith/arXiv arXiv 2026
[13]

actions": [...],

Raw JSON (preferred): {"actions": [...], "message": "..."}
[14]

Markdown fence: ‘‘‘json\\n{...}\\n‘‘‘
[15]

actions": [ {

Tagged block: <json>{...}</json> Response structure: { "actions": [ {"type": "produce", "unit": "drone"}, {"type": "move", "unit": "A_tank_1", "to": [4, 3]}, {"type": "build", "target": "credit_mine", "pos": [2, 1]} ], "diplomatic_proposal": null, "diplomatic_responses": [], "message": "Short 1-2 sentence message visible to the opponent next turn." } Maxi...
[16]

You launch the bomb and the opponent does NOT launch the same turn -> YOUR VICTORY (nuclear)
[17]

You bring the enemy base HP to 0 with tanks -> YOUR VICTORY (military)
[18]

Opponent accepts your ultimatum -> YOUR VICTORY (ultimatum)
[19]

Peace accepted -> DRAW
[20]

Both players launch on the SAME turn -> MUTUAL DESTRUCTION (both lose, 0 pts)
[21]

in flight

Turn limit reached with no outcome -> DRAW NOTE: if you ACCEPT an opponent’s ultimatum you LOSE but are awarded 0.5 consolation points (better than a 0-point defeat). So accepting a hopeless position is rewarded over fighting on. === SIMULTANEOUS NUCLEAR LAUNCH (engine-resolved) === Launches are SIMULTANEOUS and resolved by the engine at the END of the tu...

[1] [1]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021

2021

[2] [2]

M. Chen, J. Tworek, H. Jun, Q. Yuan,et al.Evaluating large language models trained on code. arXiv preprintarXiv:2107.03374, 2021

Pith/arXiv arXiv 2021

[3] [3]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. InICLR, 2021

2021

[4] [4]

Liang, R

P. Liang, R. Bommasani, T. Lee,et al.Holistic evaluation of language models.Transactions on Machine Learning Research (TMLR), 2023. arXiv:2211.09110

Pith/arXiv arXiv 2023

[5] [5]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InICLR, 2024. arXiv:2310.06770

Pith/arXiv arXiv 2024

[6] [6]

Silver, A

D. Silver, A. Huang, C. J. Maddison,et al.Mastering the game of Go with deep neural networks and tree search.Nature, 529(7587):484–489, 2016

2016

[7] [7]

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom

A. Bakhtin, N. Brown, E. Dinan,et al.Human-level play in the game of Diplomacy by combining language models with strategic reasoning.Science, 378(6624):1067–1074, 2022. doi:10.1126/science.ade9097

work page doi:10.1126/science.ade9097 2022

[8] [8]

X. Liu, H. Yu, H. Zhang,et al.AgentBench: Evaluating LLMs as agents. InICLR, 2024. arXiv:2308.03688. 20

Pith/arXiv arXiv 2024

[9] [9]

Costarelli, M

A. Costarelli, M. Allen, R. Hauksson,et al.GameBench: Evaluating strategic reasoning abilities of LLM agents.arXiv preprintarXiv:2406.06613, 2024

arXiv 2024

[10] [10]

J. Duan, R. Zhang, J. Diffenderfer,et al.GTBench: Uncovering the strategic reasoning limitations of LLMs via game-theoretic evaluations.arXiv preprintarXiv:2402.12348, 2024

arXiv 2024

[11] [11]

de Wynter

A. de Wynter. If LLMs have human-like attributes, then so does Age of Empires II.arXiv preprint arXiv:2605.31514, 2026

Pith/arXiv arXiv 2026

[12] [12]

max_turns

J. Chen, S. Cheng, C. Gurkan, and H. M. A. Fattah. To nuke or not to nuke: LLMs’ (missing) ethical reasoning and actions in a high-stakes decision-making simulation.arXiv preprintarXiv:2606.08310, 2026. A System prompt (verbatim) The complete, unedited system prompt given to every model is reproduced below. It describes only the rules and the JSON action ...

Pith/arXiv arXiv 2026

[13] [13]

actions": [...],

Raw JSON (preferred): {"actions": [...], "message": "..."}

[14] [14]

Markdown fence: ‘‘‘json\\n{...}\\n‘‘‘

[15] [15]

actions": [ {

Tagged block: <json>{...}</json> Response structure: { "actions": [ {"type": "produce", "unit": "drone"}, {"type": "move", "unit": "A_tank_1", "to": [4, 3]}, {"type": "build", "target": "credit_mine", "pos": [2, 1]} ], "diplomatic_proposal": null, "diplomatic_responses": [], "message": "Short 1-2 sentence message visible to the opponent next turn." } Maxi...

[16] [16]

You launch the bomb and the opponent does NOT launch the same turn -> YOUR VICTORY (nuclear)

[17] [17]

You bring the enemy base HP to 0 with tanks -> YOUR VICTORY (military)

[18] [18]

Opponent accepts your ultimatum -> YOUR VICTORY (ultimatum)

[19] [19]

Peace accepted -> DRAW

[20] [20]

Both players launch on the SAME turn -> MUTUAL DESTRUCTION (both lose, 0 pts)

[21] [21]

in flight

Turn limit reached with no outcome -> DRAW NOTE: if you ACCEPT an opponent’s ultimatum you LOSE but are awarded 0.5 consolation points (better than a 0-point defeat). So accepting a hopeless position is rewarded over fighting on. === SIMULTANEOUS NUCLEAR LAUNCH (engine-resolved) === Launches are SIMULTANEOUS and resolved by the engine at the END of the tu...