arxiv: 2605.03604 · v1 · submitted 2026-05-05 · 💻 cs.GT · cs.AI· cs.CY

Recognition: unknown

Multi-Agent Strategic Games with LLMs

Maxim Chupilkin

Authors on Pith no claims yet

Pith reviewed 2026-05-07 12:36 UTC · model grok-4.3

classification 💻 cs.GT cs.AIcs.CY

keywords large language modelssecurity dilemmastrategic gamesinternational relationsmulti-agent systemsgame theoryconflict and cooperationexperimental methods

0 comments

The pith

LLMs placed in a repeated security dilemma reproduce classic patterns of conflict and cooperation from international relations theory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces large language models as agents in a repeated security dilemma game to examine whether they can serve as experimental subjects for studying strategic foundations of conflict and cooperation. It extends the basic game along three dimensions drawn from theory: the number of players, whether the game ends after a known number of rounds, and whether agents can send messages to one another. Across several models the observed behavior follows predictable lines: adding more players raises the chance of conflict, known end points cause immediate unraveling to non-cooperation, and messages cut conflict by letting agents signal intentions and build reciprocity. The setup also records the models' private reasoning, making it possible to tie specific choices to strategic considerations such as preemptive moves or trust under uncertainty. The core contribution is a practical method for running scalable, transparent, and replicable tests of theoretical mechanisms.

Core claim

When large language models act as players in an extended repeated security dilemma, they produce systematic outcomes that match established predictions from international relations: multipolarity increases conflict likelihood, finite horizons trigger universal unraveling consistent with backward induction, and the option to communicate reduces conflict through signaling and reciprocity. The design also records each agent's private reasoning and public messages, allowing direct links between observed actions and underlying strategic logics such as preemption, cooperation under uncertainty, and trust-building.

What carries the argument

The repeated security dilemma game extended along multipolarity, finite time horizons, and communication, with LLMs serving as agents whose choices, messages, and internal reasoning are recorded and analyzed.

If this is right

Increasing the number of players in the game raises the observed frequency of conflict across models.
When the number of rounds is known in advance, agents defect immediately and stay defecting, matching backward-induction logic.
Adding a communication channel lowers conflict rates by allowing agents to signal intentions and sustain reciprocity.
Access to private reasoning traces shows how agents weigh preemption, uncertainty, and trust when deciding actions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LLM setup could be applied to game variants with different payoff structures to test whether the same patterns persist.
Scaling the number of agents beyond the small groups tested here might reveal whether multipolarity effects intensify or saturate.
Comparing LLM reasoning traces with human-subject data from similar games would help determine how closely the models track actual decision processes.
The transparency of recorded reasoning offers a way to generate new hypotheses about which assumptions drive behavior in security dilemmas.

Load-bearing premise

The choices and reasoning that LLMs produce in these games reflect genuine strategic mechanisms from theory rather than simply statistical patterns learned from training data.

What would settle it

Running identical games with human participants and obtaining conflict rates or unraveling patterns that differ substantially from those produced by the LLMs would indicate that the models are not reproducing the targeted mechanisms.

read the original abstract

This paper asks whether large language models (LLMs) can be used to study the strategic foundations of conflict and cooperation. I introduce LLMs as experimental subjects in a repeated security dilemma and evaluate whether they reproduce canonical mechanisms from international relations theory. The baseline game is extended along three theoretically central dimensions: multipolarity, finite time horizons, and the availability of communication. Across multiple models, the results exhibit systematic and consistent patterns: multipolarity increases the likelihood of conflict, finite horizons induce universal unraveling consistent with backward-induction logic, and communication reduces conflict by enabling signaling and reciprocity. Beyond observed behavior, the design provides access to agents' private reasoning and public messages, allowing choices to be linked to underlying strategic logics such as preemption, cooperation under uncertainty, and trust-building. The contribution is primarily methodological. LLM-based experiments offer a scalable, transparent, and replicable approach to probing theoretical mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LLMs as experimental subjects in a repeated security dilemma drawn from international relations theory. It extends the baseline game along three dimensions—multipolarity, finite time horizons, and availability of communication—and reports that, across multiple models, multipolarity increases conflict, finite horizons produce unraveling consistent with backward induction, and communication reduces conflict through signaling and reciprocity. The design grants access to private reasoning traces and public messages, allowing linkage of observed choices to strategic logics such as preemption and trust-building. The primary contribution is methodological: LLM-based experiments as a scalable, transparent, and replicable tool for probing theoretical mechanisms.

Significance. If the central empirical patterns are robust and not artifacts of training data, the work supplies a new experimental platform for game theory and IR that combines behavioral observation with direct access to reasoning traces. This could enable rapid, replicable tests of canonical predictions (e.g., multipolar instability, finite-horizon unraveling) that are difficult to run with human subjects at scale.

major comments (2)

[§3] §3 (Experimental Design): The manuscript provides no information on the number of independent runs per condition, total sample sizes, statistical tests used to establish “systematic and consistent patterns,” or controls for prompt phrasing and model-specific artifacts. These omissions are load-bearing because the central claim—that LLMs reproduce IR mechanisms rather than retrieve training-data patterns—cannot be evaluated without them.
[§4] §4 (Results): While private reasoning traces are collected, the paper does not present systematic coding or counter-examples showing that choices are driven by the payoff matrix and information structure rather than pre-trained game-theoretic knowledge. Without such evidence (e.g., novel payoff matrices never seen in training or ablation tests), consistency with backward induction or reciprocity could be explained by retrieval, undermining the claim of emergent strategic reasoning.

minor comments (2)

[Abstract] The abstract and §2 would benefit from an explicit statement of the payoff matrix and the exact set of models tested.
[Figures/Tables] Figure captions and table legends should include the precise number of observations underlying each reported percentage.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving the transparency and interpretability of our experimental results. We address each major comment below and indicate the revisions we will implement.

read point-by-point responses

Referee: §3 (Experimental Design): The manuscript provides no information on the number of independent runs per condition, total sample sizes, statistical tests used to establish “systematic and consistent patterns,” or controls for prompt phrasing and model-specific artifacts. These omissions are load-bearing because the central claim—that LLMs reproduce IR mechanisms rather than retrieve training-data patterns—cannot be evaluated without them.

Authors: We agree that these methodological details are necessary for readers to assess the reliability of the reported patterns. In the revised manuscript we will expand §3 with the following information: the number of independent runs per condition (50 runs for baseline and multipolar settings, 100 runs for finite-horizon and communication conditions), the aggregate sample size (approximately 4,800 game instances across all models and treatments), the statistical procedures employed (two-proportion z-tests and logistic regressions with robust standard errors to establish systematic differences), and the prompt controls implemented (three distinct prompt phrasings per condition plus model-specific temperature and system-prompt ablations). These additions will directly support evaluation of whether the observed behaviors reflect the game structure rather than artifacts. revision: yes
Referee: §4 (Results): While private reasoning traces are collected, the paper does not present systematic coding or counter-examples showing that choices are driven by the payoff matrix and information structure rather than pre-trained game-theoretic knowledge. Without such evidence (e.g., novel payoff matrices never seen in training or ablation tests), consistency with backward induction or reciprocity could be explained by retrieval, undermining the claim of emergent strategic reasoning.

Authors: We acknowledge that stronger evidence is required to distinguish payoff-driven reasoning from retrieval of pre-trained game-theoretic knowledge. The current draft already contains selected reasoning excerpts that reference specific payoffs and information conditions, but these are illustrative rather than systematic. In revision we will add a new subsection in §4 that reports systematic coding of 200 randomly sampled private reasoning traces (50 per major condition), with inter-coder reliability statistics and explicit categorization into payoff-matrix references, backward-induction calculations, reciprocity statements, and other categories. We will also include counter-examples where model behavior deviates from simple retrieval. However, experiments using entirely novel payoff matrices outside the training distribution cannot be completed within the revision period; we will therefore expand the limitations discussion to note this interpretive caveat while emphasizing that the private-trace design still offers greater transparency than behavioral data alone. revision: partial

standing simulated objections not resolved

Conclusive demonstration that observed strategic behavior is not attributable to retrieval of pre-trained game-theoretic knowledge, which would require new experiments with payoff matrices absent from training corpora.

Circularity Check

0 steps flagged

Empirical observational study; no derivation chain or fitted predictions present

full rationale

The paper is a methodological contribution that runs LLMs as subjects in repeated security-dilemma games and reports observed behavioral patterns across three treatment dimensions. No equations, parameter fitting, or first-principles derivations are claimed; the central results are direct experimental outcomes compared against pre-existing game-theoretic and IR predictions. No self-citations are load-bearing for any result, no ansatzes are smuggled, and no quantity is renamed or redefined in terms of itself. The work is therefore self-contained against external benchmarks and exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no explicit free parameters, axioms, or invented entities can be extracted. The approach implicitly assumes LLMs possess sufficient strategic reasoning capacity to serve as proxies for human agents in canonical games.

pith-pipeline@v0.9.0 · 5443 in / 1256 out tokens · 38743 ms · 2026-05-07T12:36:36.101352+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 30 canonical work pages

[1]

attack” and “do nothing,

Multi-Agent Strategic Games with LLMs Maxim Chupilkin University of Oxford, Department of Politics and International Relations maxim.chupilkin@politics.ox.ac.uk Abstract This paper asks whether large language models (LLMs) can be used to study the strategic foundations of conﬂict and cooperation. I introduce LLMs as experimental subjects in a repeated sec...

1957
[1]

attack” and “do nothing,

Multi-Agent Strategic Games with LLMs Maxim Chupilkin University of Oxford, Department of Politics and International Relations maxim.chupilkin@politics.ox.ac.uk Abstract This paper asks whether large language models (LLMs) can be used to study the strategic foundations of conﬂict and cooperation. I introduce LLMs as experimental subjects in a repeated sec...

1957
[2]

to the conﬂict-prone dynamics of multipolar systems (Mearsheimer 2001). Empirically, these theoretical claims have been investigated using a combination of observational data, laboratory experiments, and survey-based approaches (Tingley and Walter 2011; Tingley 2011; Kertzer et al. 2021). Experimental work in international relation has demonstrated the va...

2001
[3]

J., Bethge, M., & Schulz, E

“Playing Repeated Games with Large Language Models.” Nature Human Behaviour 9 (7): 1380–90. https://doi.org/10.1038/s41562-025-02172-y. Argyle, Lisa P., Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting, and David Wingate

work page doi:10.1038/s41562-025-02172-y
[4]

Unveiling Biases in AI: ChatGPT’s Political Economy Perspectives and Human Comparisons

“Unveiling Biases in AI: ChatGPT’s Political Economy Perspectives and Human Comparisons.” arXiv:2503.05234. Preprint, arXiv, March

work page arXiv
[5]

Unveiling Biases in AI: ChatGPT’s Political Economy Perspectives and Human Comparisons

https://doi.org/10.48550/arXiv.2503.05234. Chupilkin, Maxim. 2025a. “Left Leaning Models: AI Assumptions on Economic Policy.” arXiv:2507.15771. Preprint, arXiv, July

work page doi:10.48550/arxiv.2503.05234
[6]

The Prompt War: How AI Decides on a Military Intervention

https://doi.org/10.48550/arXiv.2507.15771. Chupilkin, Maxim. 2025b. “The Prompt War: How AI Decides on a Military Intervention.” arXiv:2507.06277. Preprint, arXiv, July

work page doi:10.48550/arxiv.2507.15771
[7]

Faulborn, Mats, Indira Sen, Max Pellert, Andreas Spitz, and David Garcia

https://doi.org/10.48550/arXiv.2507.06277. Faulborn, Mats, Indira Sen, Max Pellert, Andreas Spitz, and David Garcia

work page doi:10.48550/arxiv.2507.06277
[8]

Only a Little to the Left: A Theory-Grounded Measure of Political Bias in Large Language Models

“Only a Little to the Left: A Theory-Grounded Measure of Political Bias in Large Language Models.” arXiv:2503.16148. Version

work page arXiv
[9]

Only a Little to the Left: A Theory-Grounded Measure of Political Bias in Large Language Models

https://doi.org/10.48550/arXiv.2503.16148. Fearon, James D

work page doi:10.48550/arxiv.2503.16148
[10]

Rationalist Explanations for War

“Rationalist Explanations for War.” International Organization (00208183) 49 (3): 379–414. https://doi.org/10.1017/S0020818300033324. Fontana, Nicoló, Francesco Pierri, and Luca Maria Aiello

work page doi:10.1017/s0020818300033324
[11]

Nicer Than Humans: How Do Large Language Models Behave in the Prisoner’s Dilemma?

“Nicer Than Humans: How Do Large Language Models Behave in the Prisoner’s Dilemma?” arXiv:2406.13605. Version

work page arXiv
[12]

Nicer Than Humans: How Do Large Language Models Behave in the Prisoner’s Dilemma?

https://doi.org/10.48550/arXiv.2406.13605. Fudenberg, Drew, and Eric Maskin

work page doi:10.48550/arxiv.2406.13605
[13]

The Folk Theorem in Repeated Games with Discounting or with Incomplete Information

“The Folk Theorem in Repeated Games with Discounting or with Incomplete Information.” Econometrica 54 (3): 533–54. https://doi.org/10.2307/1911307. Hogan, Daniel P., and Andrea Brennen

work page doi:10.2307/1911307
[14]

Open-Ended Wargames with Large Language Models

“Open-Ended Wargames with Large Language Models.” arXiv:2404.11446. Preprint, arXiv, April

work page arXiv
[15]

Open-Ended Wargames with Large Language Models

https://doi.org/10.48550/arXiv.2404.11446. Horton, John J

work page doi:10.48550/arxiv.2404.11446
[16]

War and peace (waragent): Large language model-based multi-agent simulation of world wars

“War and Peace (WarAgent): Large Language Model-Based Multi-Agent Simulation of World Wars.” arXiv:2311.17227. Preprint, arXiv, January

work page arXiv
[17]

War and peace (waragent): Large language model-based multi-agent simulation of world wars

https://doi.org/10.48550/arXiv.2311.17227. Huynh, Trung-Kiet, Duy-Minh Dao-Sy, Thanh-Bang Cao, et al

work page doi:10.48550/arxiv.2311.17227
[18]

FAIRGAME: A framework for assessing LLM fairness in game-theoretic settings.arXiv preprint arXiv:2512.07462https://arxiv.org/abs/2512.07462 (2025)

“Understanding LLM Agent Behaviours via Game Theory: Strategy Recognition, Biases and Multi-Agent Dynamics.” arXiv:2512.07462. Preprint, arXiv, December

work page arXiv
[19]

FAIRGAME: A framework for assessing LLM fairness in game-theoretic settings.arXiv preprint arXiv:2512.07462https://arxiv.org/abs/2512.07462 (2025)

https://doi.org/10.48550/arXiv.2512.07462. Jensen, Benjamin, Ian Reynolds, Yasir Atalan, et al

work page doi:10.48550/arxiv.2512.07462
[20]

Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models

“Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models.” arXiv:2503.06263. Version

work page arXiv
[21]

Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models

https://doi.org/10.48550/arXiv.2503.06263. Kertzer, Joshua D., Jonathan Renshon, and Keren Yarhi-Milo

work page doi:10.48550/arxiv.2503.06263
[22]

Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations

“Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations.” arXiv:2403.03407. Preprint, arXiv, October

work page arXiv
[23]

Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations

https://doi.org/10.48550/arXiv.2403.03407. Mearsheimer, John J

work page doi:10.48550/arxiv.2403.03407
[24]

URLhttps://www

“Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning.” Science 378 (6624): 1067–74. https://doi.org/10.1126/science.ade9097. Park, Joon Sung, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein

work page doi:10.1126/science.ade9097
[25]

O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S

“Generative Agents: Interactive Simulacra of Human Behavior.” Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (New York, NY, USA), UIST ’23, October 29, 1–22. https://doi.org/10.1145/3586183.3606763. Peng, Tai-Quan, Kaiqi Yang, Sanguk Lee, et al

work page doi:10.1145/3586183.3606763
[26]

Beyond Partisan Leaning: A Comparative Analysis of Political Bias in Large Language Models

“Beyond Partisan Leaning: A Comparative Analysis of Political Bias in Large Language Models.” arXiv:2412.16746. Preprint, arXiv, May

work page arXiv
[27]

Beyond Partisan Leaning: A Comparative Analysis of Political Bias in Large Language Models

https://doi.org/10.48550/arXiv.2412.16746. Powell, Robert

work page doi:10.48550/arxiv.2412.16746
[28]

BARGAINING THEORY AND INTERNATIONAL CONFLICT

“BARGAINING THEORY AND INTERNATIONAL CONFLICT.” Annual Review of Political Science 5 (Volume 5, 2002): 1–30. https://doi.org/10.1146/annurev.polisci.5.092601.141138. Qu, Yao, and Jue Wang

work page doi:10.1146/annurev.polisci.5.092601.141138 2002
[29]

Rettenberger, Luca, Markus Reischl, and Mark Schutera

https://doi.org/10.1057/s41599-024-03609-x. Rettenberger, Luca, Markus Reischl, and Mark Schutera

work page doi:10.1057/s41599-024-03609-x
[30]

Assessing Political Bias in Large Language Models

“Assessing Political Bias in Large Language Models.” Journal of Computational Social Science 8 (2): 1–17. https://doi.org/10.1007/s42001-025-00376-w. Rivera, Juan-Pablo, Gabriel Mukobi, Anka Reuel, Max Lamparth, Chandler Smith, and Jacquelyn Schneider

work page doi:10.1007/s42001-025-00376-w
[31]

Escalation Risks from Language Models in Military and Diplomatic Decision-Making

“Escalation Risks from Language Models in Military and Diplomatic Decision-Making.” The 2024 ACM Conference on Fairness, Accountability, and Transparency, June 3, 836–98. https://doi.org/10.1145/3630106.3658942. Schelling, Thomas C

work page doi:10.1145/3630106.3658942 2024
[32]

The Dark Side of the Future: An Experimental Test of Commitment Problems in Bargaining1

“The Dark Side of the Future: An Experimental Test of Commitment Problems in Bargaining1.” International Studies Quarterly 55 (2): 521–44. https://doi.org/10.1111/j.1468-2478.2011.00654.x. Tingley, Dustin H., and Barbara F. Walter

work page doi:10.1111/j.1468-2478.2011.00654.x 2011