GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

Isabel Dahlgren; Pepijn Cobben; Terry Jingchen Zhang; Thao Amelia Pham; Xuanqiang Angelo Huang; Zhijing Jin

arxiv: 2602.12316 · v2 · pith:GKNDCVW2new · submitted 2026-02-12 · 💻 cs.AI · cs.CL· cs.CY· cs.GT· cs.MA

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

Pepijn Cobben , Xuanqiang Angelo Huang , Thao Amelia Pham , Isabel Dahlgren , Terry Jingchen Zhang , Zhijing Jin This is my paper

Pith reviewed 2026-05-25 07:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CYcs.GTcs.MA

keywords AI safetygame theorymulti-agent systemsbenchmarkPrisoner's DilemmaAI alignmentrisk assessment

0 comments

The pith

Frontier AI models fail to choose socially beneficial actions in 38% of high-stakes game-theoretic scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GT-HarmBench, a benchmark of 1,535 scenarios that maps AI risk contexts onto classic game structures such as the Prisoner's Dilemma, Stag Hunt, and Chicken. Across 15 frontier models the agents select non-beneficial actions in 38% of cases involving military escalation, election manipulation, and medical malpractice. The evaluation also tracks sensitivity to prompt framing and shows that game-theoretic interventions raise beneficial outcomes by up to 18%. The work fills the gap left by single-agent safety tests by examining coordination and conflict risks that arise when multiple AI systems interact.

Core claim

GT-HarmBench constructs 1,535 high-stakes scenarios by mapping MIT AI Risk Repository entries onto game-theoretic payoff structures. Testing 15 frontier models reveals that agents fail to select socially beneficial actions in 38% of cases. Game-theoretic interventions improve these outcomes by up to 18%, exposing reliability gaps in multi-agent alignment.

What carries the argument

GT-HarmBench, a collection of 1,535 scenarios that translate real AI risk contexts into game-theoretic payoff matrices such as the Prisoner's Dilemma, Stag Hunt, and Chicken.

If this is right

Models display systematic failures in coordination problems that parallel real-world high-stakes decisions.
Changes in prompt framing and ordering measurably alter the rate of socially beneficial choices.
Game-theoretic interventions can reduce the frequency of harmful decisions by up to 18%.
A standardized testbed now exists for studying alignment failures specific to multi-agent settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines could add explicit multi-agent dilemma exercises to reduce the observed failure modes.
Extending the benchmark to repeated or stochastic games would test whether short-term failures persist over longer interactions.
Deploying the scenarios inside simulated environments with actual tool use could check whether benchmark failures predict real deployment risks.

Load-bearing premise

The 1,535 scenarios, built by mapping risk repository entries onto game payoff structures, accurately represent the incentives that would appear in actual high-stakes multi-agent AI deployments.

What would settle it

A controlled test in which the same 15 models, placed in live multi-agent environments that replicate the benchmark payoff structures, consistently select socially beneficial actions at rates above 90% would falsify the reported failure rate.

read the original abstract

Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 1,535 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents fail to choose socially beneficial actions in 38% of high-stakes cases, such as military escalation, election manipulation, and medical malpractice. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GT-HarmBench turns the MIT risk repo into game-theory scenarios and reports 38% failure plus 18% intervention gains, but the payoff mappings have no reported validation.

read the letter

The paper's core output is GT-HarmBench: 1,535 scenarios built by mapping MIT AI Risk Repository entries onto Prisoner's Dilemma, Stag Hunt, and Chicken payoff structures. It runs 15 frontier models, finds agents choose the socially worse action 38% of the time in high-stakes cases, and shows game-theoretic prompt interventions lift good outcomes by up to 18%. They also release the benchmark and code on GitHub. That release and the focus on multi-agent coordination and conflict are the useful parts; single-agent benchmarks have left this gap open, and a concrete testbed helps close it. The mapping step is the soft spot. The abstract gives no details on how payoffs were assigned, no inter-annotator agreement numbers, and no checks on whether the fixed matrices capture repeated play, asymmetric information, or enforcement that would appear in real deployments like military escalation or election manipulation. If those mappings are loose, the 38% figure becomes hard to interpret. They do mention testing sensitivity to framing and ordering, but without statistical controls or ablation on payoff perturbations the robustness claim stays thin. This is for researchers already working on multi-agent safety who need a starting set of scenarios to run or extend. It is worth sending to peer review because the benchmark release is concrete and the topic matters, even though the construction details will need tightening before the numbers can be taken as firm evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GT-HarmBench, a new benchmark comprising 1,535 high-stakes multi-agent scenarios derived by mapping entries from the MIT AI Risk Repository to classic game-theoretic payoff structures including the Prisoner's Dilemma, Stag Hunt, and Chicken. Evaluation across 15 frontier AI models reveals a 38% rate of failure to select socially beneficial actions in these scenarios, with game-theoretic interventions yielding up to an 18% improvement. The work also examines sensitivity to prompt framing and analyzes failure reasoning patterns, releasing the benchmark and code publicly.

Significance. Should the scenario mappings accurately reflect real-world multi-agent decision incentives, GT-HarmBench would represent a meaningful advance in AI safety evaluation by shifting focus from single-agent to multi-agent risks such as coordination failures and conflict escalation. The empirical results on model failures and the effectiveness of interventions, combined with the public release of the benchmark, could serve as a useful standardized testbed for future alignment research in multi-agent settings.

major comments (2)

[Benchmark construction] The process for mapping MIT AI Risk Repository entries to specific game-theoretic payoff matrices (described in the benchmark construction section) is not described in sufficient detail, including any criteria for assignment, expert review process, inter-annotator agreement, or ablation studies on payoff variations. This mapping is load-bearing for the validity of the reported 38% failure rate and 18% intervention gains, as mismatches in incentive structures could systematically bias the results.
[Results] The paper reports aggregate failure rates and intervention effects (in the results section) without providing breakdowns by game type, risk category, or model, nor details on statistical significance testing or additional controls for prompt sensitivity, making it difficult to evaluate the robustness of the central empirical claims.

minor comments (2)

[Abstract] The abstract claims scenarios span 'game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken' but does not indicate the distribution or counts across these structures.
[Benchmark construction] Consider adding one or two concrete scenario examples in the main text to illustrate how repository entries translate into payoff matrices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Benchmark construction] The process for mapping MIT AI Risk Repository entries to specific game-theoretic payoff matrices (described in the benchmark construction section) is not described in sufficient detail, including any criteria for assignment, expert review process, inter-annotator agreement, or ablation studies on payoff variations. This mapping is load-bearing for the validity of the reported 38% failure rate and 18% intervention gains, as mismatches in incentive structures could systematically bias the results.

Authors: We agree that the mapping process requires expanded documentation to support validity claims. In the revision we will detail explicit assignment criteria (e.g., alignment of risk descriptions with dominant strategies), describe the multi-author review and disagreement resolution process, report inter-annotator agreement, and add ablation studies on payoff variations. These changes will directly mitigate concerns about systematic bias in the reported rates. revision: yes
Referee: [Results] The paper reports aggregate failure rates and intervention effects (in the results section) without providing breakdowns by game type, risk category, or model, nor details on statistical significance testing or additional controls for prompt sensitivity, making it difficult to evaluate the robustness of the central empirical claims.

Authors: We agree that disaggregated results and statistical details are needed for robustness evaluation. The revised manuscript will add tables breaking down performance by game type, risk category, and model, plus statistical significance tests for intervention effects and additional prompt-sensitivity controls. This will strengthen assessment of the central claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with direct measurements

full rationale

The paper constructs a benchmark by mapping entries from the external MIT AI Risk Repository onto standard game-theoretic payoff structures (PD, Stag Hunt, Chicken) and then reports observed failure rates from running 15 models on the resulting 1,535 scenarios. No equations, fitted parameters, or predictions are presented that reduce to the inputs by construction. The central claims (38% failure rate, up to 18% intervention improvement) are direct empirical measurements, not derived quantities. Self-citations, if present, are not load-bearing for the results. This matches the default expectation for an empirical benchmark release with no self-definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5734 in / 1079 out tokens · 21676 ms · 2026-05-25T07:29:24.667776+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance
nlin.AO 2026-05 unverdicted novelty 7.0

LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
Mechanism Design Is Not Enough: Prosocial Agents for Cooperative AI
cs.GT 2026-05 conditional novelty 6.0

Mechanism design leaves a strictly positive welfare loss under incomplete contracts for AI agents, but prosocial agents close this gap and improve social welfare.
Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection
cs.CR 2026-04 unverdicted novelty 5.0

A game-theoretic heterogeneous multi-agent architecture with three cloud LLMs and a local verifier achieves 77.2% F1, 100% recall, and 3x speedup for code vulnerability detection at $0.002 per sample on the NIST Juliet suite.