Recognition: unknown
Multi-Agent Strategic Games with LLMs
Pith reviewed 2026-05-07 12:36 UTC · model grok-4.3
The pith
LLMs placed in a repeated security dilemma reproduce classic patterns of conflict and cooperation from international relations theory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When large language models act as players in an extended repeated security dilemma, they produce systematic outcomes that match established predictions from international relations: multipolarity increases conflict likelihood, finite horizons trigger universal unraveling consistent with backward induction, and the option to communicate reduces conflict through signaling and reciprocity. The design also records each agent's private reasoning and public messages, allowing direct links between observed actions and underlying strategic logics such as preemption, cooperation under uncertainty, and trust-building.
What carries the argument
The repeated security dilemma game extended along multipolarity, finite time horizons, and communication, with LLMs serving as agents whose choices, messages, and internal reasoning are recorded and analyzed.
If this is right
- Increasing the number of players in the game raises the observed frequency of conflict across models.
- When the number of rounds is known in advance, agents defect immediately and stay defecting, matching backward-induction logic.
- Adding a communication channel lowers conflict rates by allowing agents to signal intentions and sustain reciprocity.
- Access to private reasoning traces shows how agents weigh preemption, uncertainty, and trust when deciding actions.
Where Pith is reading between the lines
- The same LLM setup could be applied to game variants with different payoff structures to test whether the same patterns persist.
- Scaling the number of agents beyond the small groups tested here might reveal whether multipolarity effects intensify or saturate.
- Comparing LLM reasoning traces with human-subject data from similar games would help determine how closely the models track actual decision processes.
- The transparency of recorded reasoning offers a way to generate new hypotheses about which assumptions drive behavior in security dilemmas.
Load-bearing premise
The choices and reasoning that LLMs produce in these games reflect genuine strategic mechanisms from theory rather than simply statistical patterns learned from training data.
What would settle it
Running identical games with human participants and obtaining conflict rates or unraveling patterns that differ substantially from those produced by the LLMs would indicate that the models are not reproducing the targeted mechanisms.
read the original abstract
This paper asks whether large language models (LLMs) can be used to study the strategic foundations of conflict and cooperation. I introduce LLMs as experimental subjects in a repeated security dilemma and evaluate whether they reproduce canonical mechanisms from international relations theory. The baseline game is extended along three theoretically central dimensions: multipolarity, finite time horizons, and the availability of communication. Across multiple models, the results exhibit systematic and consistent patterns: multipolarity increases the likelihood of conflict, finite horizons induce universal unraveling consistent with backward-induction logic, and communication reduces conflict by enabling signaling and reciprocity. Beyond observed behavior, the design provides access to agents' private reasoning and public messages, allowing choices to be linked to underlying strategic logics such as preemption, cooperation under uncertainty, and trust-building. The contribution is primarily methodological. LLM-based experiments offer a scalable, transparent, and replicable approach to probing theoretical mechanisms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLMs as experimental subjects in a repeated security dilemma drawn from international relations theory. It extends the baseline game along three dimensions—multipolarity, finite time horizons, and availability of communication—and reports that, across multiple models, multipolarity increases conflict, finite horizons produce unraveling consistent with backward induction, and communication reduces conflict through signaling and reciprocity. The design grants access to private reasoning traces and public messages, allowing linkage of observed choices to strategic logics such as preemption and trust-building. The primary contribution is methodological: LLM-based experiments as a scalable, transparent, and replicable tool for probing theoretical mechanisms.
Significance. If the central empirical patterns are robust and not artifacts of training data, the work supplies a new experimental platform for game theory and IR that combines behavioral observation with direct access to reasoning traces. This could enable rapid, replicable tests of canonical predictions (e.g., multipolar instability, finite-horizon unraveling) that are difficult to run with human subjects at scale.
major comments (2)
- [§3] §3 (Experimental Design): The manuscript provides no information on the number of independent runs per condition, total sample sizes, statistical tests used to establish “systematic and consistent patterns,” or controls for prompt phrasing and model-specific artifacts. These omissions are load-bearing because the central claim—that LLMs reproduce IR mechanisms rather than retrieve training-data patterns—cannot be evaluated without them.
- [§4] §4 (Results): While private reasoning traces are collected, the paper does not present systematic coding or counter-examples showing that choices are driven by the payoff matrix and information structure rather than pre-trained game-theoretic knowledge. Without such evidence (e.g., novel payoff matrices never seen in training or ablation tests), consistency with backward induction or reciprocity could be explained by retrieval, undermining the claim of emergent strategic reasoning.
minor comments (2)
- [Abstract] The abstract and §2 would benefit from an explicit statement of the payoff matrix and the exact set of models tested.
- [Figures/Tables] Figure captions and table legends should include the precise number of observations underlying each reported percentage.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving the transparency and interpretability of our experimental results. We address each major comment below and indicate the revisions we will implement.
read point-by-point responses
-
Referee: §3 (Experimental Design): The manuscript provides no information on the number of independent runs per condition, total sample sizes, statistical tests used to establish “systematic and consistent patterns,” or controls for prompt phrasing and model-specific artifacts. These omissions are load-bearing because the central claim—that LLMs reproduce IR mechanisms rather than retrieve training-data patterns—cannot be evaluated without them.
Authors: We agree that these methodological details are necessary for readers to assess the reliability of the reported patterns. In the revised manuscript we will expand §3 with the following information: the number of independent runs per condition (50 runs for baseline and multipolar settings, 100 runs for finite-horizon and communication conditions), the aggregate sample size (approximately 4,800 game instances across all models and treatments), the statistical procedures employed (two-proportion z-tests and logistic regressions with robust standard errors to establish systematic differences), and the prompt controls implemented (three distinct prompt phrasings per condition plus model-specific temperature and system-prompt ablations). These additions will directly support evaluation of whether the observed behaviors reflect the game structure rather than artifacts. revision: yes
-
Referee: §4 (Results): While private reasoning traces are collected, the paper does not present systematic coding or counter-examples showing that choices are driven by the payoff matrix and information structure rather than pre-trained game-theoretic knowledge. Without such evidence (e.g., novel payoff matrices never seen in training or ablation tests), consistency with backward induction or reciprocity could be explained by retrieval, undermining the claim of emergent strategic reasoning.
Authors: We acknowledge that stronger evidence is required to distinguish payoff-driven reasoning from retrieval of pre-trained game-theoretic knowledge. The current draft already contains selected reasoning excerpts that reference specific payoffs and information conditions, but these are illustrative rather than systematic. In revision we will add a new subsection in §4 that reports systematic coding of 200 randomly sampled private reasoning traces (50 per major condition), with inter-coder reliability statistics and explicit categorization into payoff-matrix references, backward-induction calculations, reciprocity statements, and other categories. We will also include counter-examples where model behavior deviates from simple retrieval. However, experiments using entirely novel payoff matrices outside the training distribution cannot be completed within the revision period; we will therefore expand the limitations discussion to note this interpretive caveat while emphasizing that the private-trace design still offers greater transparency than behavioral data alone. revision: partial
- Conclusive demonstration that observed strategic behavior is not attributable to retrieval of pre-trained game-theoretic knowledge, which would require new experiments with payoff matrices absent from training corpora.
Circularity Check
Empirical observational study; no derivation chain or fitted predictions present
full rationale
The paper is a methodological contribution that runs LLMs as subjects in repeated security-dilemma games and reports observed behavioral patterns across three treatment dimensions. No equations, parameter fitting, or first-principles derivations are claimed; the central results are direct experimental outcomes compared against pre-existing game-theoretic and IR predictions. No self-citations are load-bearing for any result, no ansatzes are smuggled, and no quantity is renamed or redefined in terms of itself. The work is therefore self-contained against external benchmarks and exhibits no circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
attack” and “do nothing,
Multi-Agent Strategic Games with LLMs Maxim Chupilkin University of Oxford, Department of Politics and International Relations maxim.chupilkin@politics.ox.ac.uk Abstract This paper asks whether large language models (LLMs) can be used to study the strategic foundations of conflict and cooperation. I introduce LLMs as experimental subjects in a repeated sec...
1957
-
[1]
attack” and “do nothing,
Multi-Agent Strategic Games with LLMs Maxim Chupilkin University of Oxford, Department of Politics and International Relations maxim.chupilkin@politics.ox.ac.uk Abstract This paper asks whether large language models (LLMs) can be used to study the strategic foundations of conflict and cooperation. I introduce LLMs as experimental subjects in a repeated sec...
1957
-
[2]
to the conflict-prone dynamics of multipolar systems (Mearsheimer 2001). Empirically, these theoretical claims have been investigated using a combination of observational data, laboratory experiments, and survey-based approaches (Tingley and Walter 2011; Tingley 2011; Kertzer et al. 2021). Experimental work in international relation has demonstrated the va...
2001
-
[3]
“Playing Repeated Games with Large Language Models.” Nature Human Behaviour 9 (7): 1380–90. https://doi.org/10.1038/s41562-025-02172-y. Argyle, Lisa P., Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting, and David Wingate
-
[4]
Unveiling Biases in AI: ChatGPT’s Political Economy Perspectives and Human Comparisons
“Unveiling Biases in AI: ChatGPT’s Political Economy Perspectives and Human Comparisons.” arXiv:2503.05234. Preprint, arXiv, March
-
[5]
Unveiling Biases in AI: ChatGPT’s Political Economy Perspectives and Human Comparisons
https://doi.org/10.48550/arXiv.2503.05234. Chupilkin, Maxim. 2025a. “Left Leaning Models: AI Assumptions on Economic Policy.” arXiv:2507.15771. Preprint, arXiv, July
-
[6]
The Prompt War: How AI Decides on a Military Intervention
https://doi.org/10.48550/arXiv.2507.15771. Chupilkin, Maxim. 2025b. “The Prompt War: How AI Decides on a Military Intervention.” arXiv:2507.06277. Preprint, arXiv, July
-
[7]
Faulborn, Mats, Indira Sen, Max Pellert, Andreas Spitz, and David Garcia
https://doi.org/10.48550/arXiv.2507.06277. Faulborn, Mats, Indira Sen, Max Pellert, Andreas Spitz, and David Garcia
-
[8]
Only a Little to the Left: A Theory-Grounded Measure of Political Bias in Large Language Models
“Only a Little to the Left: A Theory-Grounded Measure of Political Bias in Large Language Models.” arXiv:2503.16148. Version
-
[9]
Only a Little to the Left: A Theory-Grounded Measure of Political Bias in Large Language Models
https://doi.org/10.48550/arXiv.2503.16148. Fearon, James D
-
[10]
Rationalist Explanations for War
“Rationalist Explanations for War.” International Organization (00208183) 49 (3): 379–414. https://doi.org/10.1017/S0020818300033324. Fontana, Nicoló, Francesco Pierri, and Luca Maria Aiello
-
[11]
Nicer Than Humans: How Do Large Language Models Behave in the Prisoner’s Dilemma?
“Nicer Than Humans: How Do Large Language Models Behave in the Prisoner’s Dilemma?” arXiv:2406.13605. Version
-
[12]
Nicer Than Humans: How Do Large Language Models Behave in the Prisoner’s Dilemma?
https://doi.org/10.48550/arXiv.2406.13605. Fudenberg, Drew, and Eric Maskin
-
[13]
The Folk Theorem in Repeated Games with Discounting or with Incomplete Information
“The Folk Theorem in Repeated Games with Discounting or with Incomplete Information.” Econometrica 54 (3): 533–54. https://doi.org/10.2307/1911307. Hogan, Daniel P., and Andrea Brennen
-
[14]
Open-Ended Wargames with Large Language Models
“Open-Ended Wargames with Large Language Models.” arXiv:2404.11446. Preprint, arXiv, April
-
[15]
Open-Ended Wargames with Large Language Models
https://doi.org/10.48550/arXiv.2404.11446. Horton, John J
-
[16]
War and peace (waragent): Large language model-based multi-agent simulation of world wars
“War and Peace (WarAgent): Large Language Model-Based Multi-Agent Simulation of World Wars.” arXiv:2311.17227. Preprint, arXiv, January
-
[17]
War and peace (waragent): Large language model-based multi-agent simulation of world wars
https://doi.org/10.48550/arXiv.2311.17227. Huynh, Trung-Kiet, Duy-Minh Dao-Sy, Thanh-Bang Cao, et al
-
[18]
“Understanding LLM Agent Behaviours via Game Theory: Strategy Recognition, Biases and Multi-Agent Dynamics.” arXiv:2512.07462. Preprint, arXiv, December
-
[19]
https://doi.org/10.48550/arXiv.2512.07462. Jensen, Benjamin, Ian Reynolds, Yasir Atalan, et al
-
[20]
“Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models.” arXiv:2503.06263. Version
-
[21]
https://doi.org/10.48550/arXiv.2503.06263. Kertzer, Joshua D., Jonathan Renshon, and Keren Yarhi-Milo
-
[22]
“Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations.” arXiv:2403.03407. Preprint, arXiv, October
-
[23]
https://doi.org/10.48550/arXiv.2403.03407. Mearsheimer, John J
-
[24]
“Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning.” Science 378 (6624): 1067–74. https://doi.org/10.1126/science.ade9097. Park, Joon Sung, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein
-
[25]
O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S
“Generative Agents: Interactive Simulacra of Human Behavior.” Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (New York, NY, USA), UIST ’23, October 29, 1–22. https://doi.org/10.1145/3586183.3606763. Peng, Tai-Quan, Kaiqi Yang, Sanguk Lee, et al
-
[26]
Beyond Partisan Leaning: A Comparative Analysis of Political Bias in Large Language Models
“Beyond Partisan Leaning: A Comparative Analysis of Political Bias in Large Language Models.” arXiv:2412.16746. Preprint, arXiv, May
-
[27]
Beyond Partisan Leaning: A Comparative Analysis of Political Bias in Large Language Models
https://doi.org/10.48550/arXiv.2412.16746. Powell, Robert
-
[28]
BARGAINING THEORY AND INTERNATIONAL CONFLICT
“BARGAINING THEORY AND INTERNATIONAL CONFLICT.” Annual Review of Political Science 5 (Volume 5, 2002): 1–30. https://doi.org/10.1146/annurev.polisci.5.092601.141138. Qu, Yao, and Jue Wang
-
[29]
Rettenberger, Luca, Markus Reischl, and Mark Schutera
https://doi.org/10.1057/s41599-024-03609-x. Rettenberger, Luca, Markus Reischl, and Mark Schutera
-
[30]
Assessing Political Bias in Large Language Models
“Assessing Political Bias in Large Language Models.” Journal of Computational Social Science 8 (2): 1–17. https://doi.org/10.1007/s42001-025-00376-w. Rivera, Juan-Pablo, Gabriel Mukobi, Anka Reuel, Max Lamparth, Chandler Smith, and Jacquelyn Schneider
-
[31]
Escalation Risks from Language Models in Military and Diplomatic Decision-Making
“Escalation Risks from Language Models in Military and Diplomatic Decision-Making.” The 2024 ACM Conference on Fairness, Accountability, and Transparency, June 3, 836–98. https://doi.org/10.1145/3630106.3658942. Schelling, Thomas C
-
[32]
The Dark Side of the Future: An Experimental Test of Commitment Problems in Bargaining1
“The Dark Side of the Future: An Experimental Test of Commitment Problems in Bargaining1.” International Studies Quarterly 55 (2): 521–44. https://doi.org/10.1111/j.1468-2478.2011.00654.x. Tingley, Dustin H., and Barbara F. Walter
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.