RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments

Babak Rahmani; Joschka Str\"uber; Matthias Bethge; Sebastian Dziadzio; Sergio Hern\'andez-Guti\'errez

arxiv: 2606.26094 · v1 · pith:WK6JFG3Knew · submitted 2026-06-24 · 💻 cs.LG

RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments

Babak Rahmani , Sebastian Dziadzio , Joschka Str\"uber , Sergio Hern\'andez-Guti\'errez , Matthias Bethge This is my paper

Pith reviewed 2026-06-25 19:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords policy reconstructionbehavioral experimentsexecutable code recoverygame environmentsopponent modelingLLM benchmarksinverse problems

0 comments

The pith

A learner can reconstruct an agent's hidden decision code from its observed game behavior, and recovers substantially more when allowed to design its own experiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether an agent's underlying decision program can be recovered as executable code given only traces of its actions against various opponents. It shows that recovery improves when the learner first designs custom opponent policies to serve as targeted probes before submitting its hypothesis code. The quality of each reconstruction is scored by continuous metrics that compare the actions the recovered code would take to those of the original policy. Recovered code is further tested by entering the reconstructed policy into tournaments, where it produces measurable wins especially for base models that otherwise perform poorly at counter-strategy design.

Core claim

Given only behavioral traces of an agent in a game environment, a learner can reconstruct the underlying decision program as executable code. Recovery improves when the learner designs controlled experiments in the form of custom opponent policies that elicit informative behavior. The recovered code carries informative signal that yields competitive advantage in downstream player-versus-player tournaments.

What carries the argument

RevengeBench, a benchmark of 75 LLM-generated Elo-calibrated policies across five game environments, scored by continuous action-distance metrics and downstream tournament performance.

If this is right

Reconstructed policies produce measurable competitive advantage when entered into player-versus-player tournaments.
Recovery quality varies substantially across frontier LLMs, closing between 34 and 72 percent of the initial distance to the target policy.
Weaker base models gain the largest tournament benefit from recovered code that enables effective counter-strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same active-probe approach could be applied to recover decision logic from black-box systems outside games.
It opens a route to automated policy debugging by iteratively refining executable hypotheses against observed behavior.
If the method generalizes, it supplies a concrete test for claims about inferring latent mechanisms from limited observations.

Load-bearing premise

The 75 LLM-generated policies form a representative testbed in which action-distance metrics validly capture reconstruction quality and tournament advantage.

What would settle it

An experiment in which permitting the learner to design behavioral probes yields no measurable increase in reconstruction accuracy or downstream tournament wins relative to passive observation alone.

Figures

Figures reproduced from arXiv: 2606.26094 by Babak Rahmani, Joschka Str\"uber, Matthias Bethge, Sebastian Dziadzio, Sergio Hern\'andez-Guti\'errez.

**Figure 1.** Figure 1: Benchmark overview. Left: the learner alternates between passively observing the hidden policy play against sampled opponents and actively probing it with self-authored opponents. Right: fraction of initial action distance closed (↑) for twelve frontier LLMs using mini-SWE-agent. Inverse problems in agent modeling often assume one of two access regimes: a fixed corpus of demonstrations to imitate or invert… view at source ↗

**Figure 2.** Figure 2: Iterative reasoning in BattleSnake. Left: the reasoning trace of GPT-5 within a single round, all quotes verbatim. Right: GPT-5.4-mini’s per-round action distance and submitted strategy summaries: a good initial hypothesis is followed by a regression and a self-correction by round 5. substantially: the best model closes 72% of initial behavioral distance, the weakest only 34%. Active probing helps some mod… view at source ↗

**Figure 3.** Figure 3: Strategy recovery performance. Top left: Cumulative distribution of distance reduction across all 75 targets. Top right: Cost–quality trade-off: mean API cost per run vs. distance reduction (%). Bottom: Per-model, per-game mean distance reduction (%). Gemma-4 31B achieves 51% recovery at $0.06 per run. No meaningful correlation emerges between target playing strength and reverse-engineering difficulty (Spe… view at source ↗

**Figure 4.** Figure 4: PvP tournament: a challenger agent writes a counter-strategy against a fixed target over 5 rounds, under three levels of opponent information (blind, recovered, oracle). Results are averaged over 20 targets (5 per game), 5 rounds, and 3 seeds per model. Left: Win rate per challenger model, averaged across games and rounds. Horizontal ticks show the cross-game mean. The ordering oracle > recovered > blind h… view at source ↗

**Figure 5.** Figure 5: Step-level action distribution across rounds for twelve models over all 75 targets. Cumulative height reflects the number of trajectories still active at each step. plurality of turns on Write with very little Execute, suggesting it attempts large edits rather than incremental test-fix loops. Why does probing help selectively? A probe is an executable experiment the learner must design, implement, and depl… view at source ↗

**Figure 6.** Figure 6: Per-game normalised recovery score for all models. Performance rankings vary across [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Mean number of probes executed per tournament, grouped by game. Each bar is segmented [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Mean inline probe failures per tournament. Each bar is segmented by round (darker = [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Probe boost vs. average probes used per tournament. Each point is one (model [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

**Figure 10.** Figure 10: Fraction of targets where the challenger achieves at least a given win rate, by condition. [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Oracle lift (oracle minus blind win rate) across rounds, with [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: Per-model PvP win-rate trajectories across games and intel conditions (blind, recovered, [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Context-length distribution and cumulative input cost across two independent GPT-5 runs [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

**Figure 14.** Figure 14: History compaction ablation (GPT-5, 3 targets [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗

**Figure 15.** Figure 15: Effect of history compaction on context length (same ablation as Figure [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

**Figure 16.** Figure 16: Within-Gemma variance components at each round, grouped side-by-side (not stacked, [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗

**Figure 17.** Figure 17: Within-Gemma cross-run Spearman ρ between per-target normalised-recovery rankings, one panel per arena. Mean ρ in each panel title. BattleSnake and Poker show uniformly high agreement; Halite, RoboCode, and RobotRumble show large run-to-run rank flips. DeepSeek v4 Flash DeepSeek v4 Pro GLM-5.1GPT-5 GPT-5.4-mini GPT-5.5 GPT-oss-120b Gemma-4 26B-A4B Gemma-4 31B Grok-4.1-fast Kimi K2.6 Qwen3.6 35B-A3B 0.0 0.… view at source ↗

**Figure 18.** Figure 18: Distribution of σsim at round 5, one boxplot per evaluator, faceted by arena. The withinarena overlap across the 12 evaluators supports treating σsim as an arena-level property rather than a model-specific quantity. σsim is an arena-level property [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗

**Figure 19.** Figure 19: Per-target difficulty agreement: Gemma 4 31B’s 5-run mean recovery score (x-axis) versus [PITH_FULL_IMAGE:figures/full_fig_p038_19.png] view at source ↗

**Figure 20.** Figure 20: Total API cost breakdown per model across all 75 runs. Stacked segments show prompt, [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗

**Figure 21.** Figure 21: Total token usage per model (millions). Stacked segments show prompt, reasoning, and [PITH_FULL_IMAGE:figures/full_fig_p039_21.png] view at source ↗

read the original abstract

For most of scientific history, researchers studying behavior could only infer hidden mechanisms from outward actions: an inverse problem that becomes more tractable when observation is augmented by targeted intervention. We pose a computational analogue: given only behavioral traces of an agent in a game environment, can a learner reconstruct the underlying decision program as executable code, and how much does this reconstruction improve with the ability to design controlled experiments? We introduce RevengeBench, a benchmark of 75 LLM generated, Elo-calibrated policies across five game environments, drawn from CodeClash tournament trajectories. The learner observes the hidden target policy play against sampled opponents and designs behavioral probes in the form of custom opponent policies that elicit informative behavior. It then submits an executable hypothesis, which is evaluated using continuous action-distance metrics. We further validate that recovered code carries informative signal in downstream player-versus-player tournaments. Across twelve frontier LLMs, recovery quality varies substantially (34 to 72% of initial distance closed), with reconstructed policies yielding measurable competitive advantage, particularly for weaker models that otherwise struggle to design effective counter-strategies. Our benchmark positions behavioral recovery of programmatic policies as a tractable inverse problem in code-space, opening a path to opponent modeling, policy interpretability, and the broader question of inferring latent mechanisms from observations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RevengeBench sets up code recovery from behavior but its metrics likely only track behavioral similarity rather than exact program match.

read the letter

The key takeaway is that RevengeBench introduces a setup for recovering executable policy code from behavioral traces by letting the learner design its own experiments, but the evaluation may not distinguish true code recovery from behavioral copying.

The new part is the full pipeline: 75 Elo-calibrated LLM policies from CodeClash, custom opponent probes, executable hypothesis submission, action-distance scoring, and PvP tournament validation. This creates a controlled test of the inverse problem in code space. The results across twelve LLMs show varying success in closing distance and gaining competitive edge, which is a solid empirical start for this kind of benchmark.

It does well in framing the problem with intervention, not just passive observation, and in checking downstream utility through tournaments. That gives the work a practical angle beyond pure metrics.

The soft spot is the one the stress-test flags. The paper scores hypotheses with continuous action-distance metrics against observed behavior. This can be reduced by any policy that matches the actions in the probed situations, even if the internal decision logic differs from the target. The abstract gives no indication of additional checks like code similarity, execution traces on unseen states, or manual inspection to confirm structural match. So the 34-72% figures and tournament gains could reflect successful approximation rather than recovery of the specific program. The assumption that the LLM-generated testbed is representative and that distance metrics validly measure code recovery is load-bearing and not obviously supported here.

This paper is for people working on agent interpretability, reverse engineering of policies, or benchmarks for LLM code generation in games. A reader focused on those areas would get value from the protocol and the reported numbers. It deserves serious referee time because the idea is distinct from prior work and the experimental design is concrete, even if the metrics need more justification.

I would recommend sending it for peer review.

Referee Report

2 major / 0 minor

Summary. The paper introduces RevengeBench, a benchmark of 75 LLM-generated, Elo-calibrated policies across five game environments drawn from CodeClash trajectories. It poses the inverse problem of recovering the underlying decision program as executable code from behavioral traces alone, and measures how much recovery improves when the learner can design custom opponent policies as behavioral probes. Recovery quality is assessed via continuous action-distance metrics on observed behavior, with further validation that the recovered code yields competitive advantage in downstream player-versus-player tournaments. Results across twelve frontier LLMs show 34–72% of initial distance closed, with larger gains for weaker models.

Significance. If the central claim holds, the work establishes a tractable benchmark for programmatic policy recovery in code-space, with direct relevance to opponent modeling and interpretability. The scale (75 policies), Elo calibration, and tournament-based validation are concrete strengths that provide falsifiable downstream evidence beyond the primary metrics. The framing as an inverse problem augmented by experimental design is a clear conceptual contribution.

major comments (2)

[Abstract] Abstract: the central claim is recovery of the 'underlying decision program as executable code,' yet evaluation uses only 'continuous action-distance metrics' on observed behavior against sampled opponents. Because distinct programs can produce equivalent or near-equivalent action distributions (especially in finite state spaces with LLM-generated opponents), a reduction in action distance does not entail that the submitted hypothesis matches the target program's structure or logic. This assumption is load-bearing for positioning the benchmark as solving an inverse problem in code-space rather than behavioral approximation.
[Abstract] Abstract: the downstream tournament validation is presented as confirming that 'recovered code carries informative signal,' but without details on how the recovered executable is executed or compared to the original program (e.g., structural similarity, exact code match, or ablation on whether behavioral mimicry alone suffices for the observed gains), it remains unclear whether the tournament results isolate program recovery from successful behavioral cloning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important distinctions between behavioral approximation and structural code recovery, which we address point by point below with proposed revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim is recovery of the 'underlying decision program as executable code,' yet evaluation uses only 'continuous action-distance metrics' on observed behavior against sampled opponents. Because distinct programs can produce equivalent or near-equivalent action distributions (especially in finite state spaces with LLM-generated opponents), a reduction in action distance does not entail that the submitted hypothesis matches the target program's structure or logic. This assumption is load-bearing for positioning the benchmark as solving an inverse problem in code-space rather than behavioral approximation.

Authors: We agree that a reduction in action-distance does not guarantee that the recovered code matches the target's internal structure or logic, since multiple programs can produce similar behaviors. The benchmark evaluates the submission of executable code that achieves functional recovery as measured by behavioral metrics, rather than syntactic or structural identity. We will revise the abstract to clarify that the inverse problem is addressed through functional equivalence in code form, as quantified by the action-distance metrics, rather than exact recovery of the decision logic. This change will better align the positioning with the evaluation approach. revision: yes
Referee: [Abstract] Abstract: the downstream tournament validation is presented as confirming that 'recovered code carries informative signal,' but without details on how the recovered executable is executed or compared to the original program (e.g., structural similarity, exact code match, or ablation on whether behavioral mimicry alone suffices for the observed gains), it remains unclear whether the tournament results isolate program recovery from successful behavioral cloning.

Authors: The recovered code is executed directly as the agent's policy within the game environments for the player-versus-player tournaments, with performance measured via win rates and competitive advantage relative to the original target and baselines. We will add explicit details on this execution process in the methods and results sections. We will also include a discussion noting that the observed gains may partly derive from behavioral approximation and that the benchmark does not include ablations fully isolating structural recovery from cloning. The code output format nonetheless provides additional utility for interpretability and further experimentation beyond pure behavioral cloning. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent evaluation metrics

full rationale

The paper presents RevengeBench as an empirical benchmark for recovering executable policies from behavioral traces, using action-distance metrics on observed play and downstream tournament performance as direct measures. No equations, fitted parameters, or derivations are described that reduce the claimed code-space reconstruction to the input traces or metrics by construction. The testbed is drawn from external CodeClash trajectories, and the evaluation chain (probe design, hypothesis submission, distance closure, tournament validation) does not rely on self-definitional loops, self-citation load-bearing premises, or renaming of known results. The central claim remains a falsifiable empirical question about LLM performance on the benchmark rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no free parameters, invented entities, or detailed axioms are stated. The central premise rests on the domain assumption that game environments permit informative behavioral probes via custom opponents.

axioms (1)

domain assumption Behavioral traces from game play against sampled opponents can be augmented by learner-designed custom opponents to produce sufficiently informative data for executable code recovery.
This premise underpins the entire experimental loop described in the abstract.

pith-pipeline@v0.9.1-grok · 5774 in / 1289 out tokens · 28326 ms · 2026-06-25T19:10:42.291110+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 1 canonical work pages

[1]

Ng and Stuart Russell , editor =

Andrew Y. Ng and Stuart Russell , editor =. Algorithms for Inverse Reinforcement Learning , booktitle =
[2]

Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2025), Montreal, Canada, August 16-22, 2025 , pages=

Combining code generating large language models and self-play to iteratively refine strategies in games , author=. Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2025), Montreal, Canada, August 16-22, 2025 , pages=

2025
[3]

Cognition , volume=

Action understanding as inverse planning , author=. Cognition , volume=. 2009 , publisher=

2009
[4]

Naik, Atharva and Mathur, Yash and Agrawal, Darsh and Kapadnis, Manav and An, Yuwei and Marr, Clayton and Rose, Carolyn and Mortensen, David and others , journal=
[5]

Griffiths , booktitle=

Jiayi Geng and Howard Chen and Dilip Arumugam and Thomas L. Griffiths , booktitle=. Are Large Language Models Reliable. 2025 , url=

2025
[6]

, title =

Abbeel, Pieter and Ng, Andrew Y. , title =. Proceedings of the Twenty-First International Conference on Machine Learning , pages =. 2004 , publisher =

2004
[7]

Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages =

Torabi, Faraz and Warnell, Garrett and Stone, Peter , title =. Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages =. 2018 , isbn =

2018
[8]

Yang, John and Lieret, Kilian and Yang, Joyce and Jimenez, Carlos E and Press, Ofir and Schmidt, Ludwig and Yang, Diyi , journal=
[9]

Simpletom: Exposing the gap between explicit tom inference and implicit tom application in

Gu, Yuling and Tafjord, Oyvind and Kim, Hyunwoo and Moore, Jared and Bras, Ronan Le and Clark, Peter and Choi, Yejin , journal=. Simpletom: Exposing the gap between explicit tom inference and implicit tom application in
[10]

Nature Human Behaviour , volume=

Testing theory of mind in large language models and humans , author=. Nature Human Behaviour , volume=. 2024 , publisher=

2024
[11]

Proceedings of the National Academy of Sciences , volume=

Evaluating large language models in theory of mind tasks , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

2024
[12]

2026 , howpublished =

2026
[13]

Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems , pages =

Hennes, Daniel and Li, Zun and Schultz, John and Lanctot, Marc , title =. Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems , pages =. 2026 , isbn =

2026
[14]

The Fourteenth International Conference on Learning Representations , year=

Modeling Others' Minds as Code , author=. The Fourteenth International Conference on Learning Representations , year=
[15]

Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models , url =

Cross, Logan and Xiang, Violet and Bhatia, Agam and Yamins, Daniel and Haber, Nick , booktitle =. Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models , url =
[16]

Second Conference on Language Modeling , year=

Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models , author=. Second Conference on Language Modeling , year=
[17]

2026 , url=

Zhining Zhang and Chuanyang Jin and Mung Yao Jia and Shunchi Zhang and Tianmin Shu , booktitle=. 2026 , url=

2026
[18]

Proceedings of the 35th International Conference on Machine Learning , pages =

Machine Theory of Mind , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

2018
[19]

Understanding Social Reasoning in Language Models with Language Models , url =

Gandhi, Kanishk and Fraenken, Jan-Philipp and Gerstenberg, Tobias and Goodman, Noah , booktitle =. Understanding Social Reasoning in Language Models with Language Models , url =
[20]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Theory of mind for multi-agent collaboration via large language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[21]

Forty-second International Conference on Machine Learning Position Paper Track , year=

Position: Theory of Mind Benchmarks are Broken for Large Language Models , author=. Forty-second International Conference on Machine Learning Position Paper Track , year=
[22]

Nashed, Samer and Zilberstein, Shlomo , title =. J. Artif. Int. Res. , month = may, numpages =. 2022 , issue_date =

2022
[23]

and Al-Shedivat, Maruan and Whiteson, Shimon and Abbeel, Pieter and Mordatch, Igor , title =

Foerster, Jakob and Chen, Richard Y. and Al-Shedivat, Maruan and Whiteson, Shimon and Abbeel, Pieter and Mordatch, Igor , title =. Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems , pages =. 2018 , publisher =

2018
[24]

International Conference on Machine Learning , pages=

Model-free opponent shaping , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[25]

Scaling Opponent Shaping to High Dimensional Games , year =

Khan, Akbir and Willi, Timon and Kwan, Newton and Tacchetti, Andrea and Lu, Chris and Grefenstette, Edward and Rockt\". Scaling Opponent Shaping to High Dimensional Games , year =. Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems , pages =
[26]

The Thirteenth International Conference on Learning Representations , year=

Advantage Alignment Algorithms , author=. The Thirteenth International Conference on Learning Representations , year=
[27]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Opponent Modeling with In-context Search , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[28]

Yu, XiaoPeng and Zhang, Wanpeng and Lu, Zongqing , booktitle=
[29]

Workshop on Multi-Agent Learning and Its Opportunities in the Era of Generative AI , year=

Scaling Inference-Time Computation via Opponent Simulation: Enabling Online Strategic Adaptation in Repeated Negotiation , author=. Workshop on Multi-Agent Learning and Its Opportunities in the Era of Generative AI , year=
[30]

Nature Human Behaviour , volume=

Playing repeated games with large language models , author=. Nature Human Behaviour , volume=. 2025 , publisher=

2025
[31]

2024 , url=

Jinhao Duan and Renming Zhang and James Diffenderfer and Bhavya Kailkhura and Lichao Sun and Elias Stengel-Eskin and Mohit Bansal and Tianlong Chen and Kaidi Xu , booktitle=. 2024 , url=

2024
[32]

2024 , url=

Anthony Costarelli and Mat Allen and Roman Hauksson and Grace Sodunke and Suhas Hariharan and Carlson Cheng and Wenjie Li and Joshua M Clymer and Arjun Yadav , booktitle=. 2024 , url=

2024
[33]

Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of

Giorgio Piatti and Zhijing Jin and Max Kleiman-Weiner and Bernhard Sch. Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[34]

arXiv preprint arXiv:2509.17158 , year=

Froger, Romain and Andrews, Pierre and Bettini, Matteo and Budhiraja, Amar and Cabral, Ricardo Silveira and Do, Virginie and Garreau, Emilien and Gaya, Jean-Baptiste and Lauren. arXiv preprint arXiv:2509.17158 , year=

arXiv
[35]

Cheng, Haowei and Kim, Milhan and Khomh, Foutse and Racharak, Teeradaj and Yoshioka, Nobukazu and Ubayashi, Naoyasu and Washizaki, Hironori , journal=
[36]

Nature , volume=

Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=

2024
[37]

Yamada, Yutaro and Lange, Robert Tjarko and Lu, Cong and Hu, Shengran and Lu, Chris and Foerster, Jakob and Clune, Jeff and Ha, David , journal=. The
[38]

First Workshop on Foundations of Reasoning in Language Models , year=

Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction , author=. First Workshop on Foundations of Reasoning in Language Models , year=
[39]

Anjiang Wei and Tarun Suresh and Jiannan Cao and Naveen Kannan and Yuheng Wu and Kai Yan and Thiago S. F. X. Teixeira and Ke Wang and Alex Aiken , booktitle=. Code. 2025 , url=

2025
[40]

2026 , url=

Deepro Choudhury and Sinead Williamson and Adam Golinski and Ning Miao and Freddie Bickford Smith and Michael Kirchhof and Yizhe Zhang and Tom Rainforth , booktitle=. 2026 , url=

2026
[41]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Probing the multi-turn planning capabilities of LLMs via 20 question games , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[42]

doi:10.52202/079017-2243 , editor =

Tang, Hao and Key, Darren and Ellis, Kevin , booktitle =. doi:10.52202/079017-2243 , editor =

work page doi:10.52202/079017-2243
[43]

Reddy , booktitle=

Parshin Shojaee and Kazem Meidani and Shashank Gupta and Amir Barati Farimani and Chandan K. Reddy , booktitle=. 2025 , url=

2025
[44]

1983 , publisher=

Representing and intervening: Introductory topics in the philosophy of natural science , author=. 1983 , publisher=

1983
[45]

2024 , url=

John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

2024
[46]

Second Conference on Language Modeling , year=

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility , author=. Second Conference on Language Modeling , year=
[47]

, journal =

Wang, Sida I. , journal =. Measuring all the noises of
[48]

arXiv preprint arXiv:2007.10504 , year=

Battlesnake challenge: A multi-agent reinforcement learning playground with human-in-the-loop , author=. arXiv preprint arXiv:2007.10504 , year=

arXiv 2007
[49]

2016 , url =

Michael Truell and Benjamin Spector , title =. 2016 , url =

2016
[50]

2025 , url =

Bhavesh Kumar and Hoang Nguyen and Roger Jin , title =. 2025 , url =

2025
[51]

Journal of Computing Sciences in Colleges , volume=

Robocode: using games to teach artificial intelligence , author=. Journal of Computing Sciences in Colleges , volume=. 2004 , publisher=

2004
[52]

2020 , url =

Outkine, Anton and Oxer, Noa , title =. 2020 , url =

2020
[53]

Singh, Aaditya and Fry, Adam and Perelman, Adam and Tart, Adam and Ganesh, Adi and El-Kishky, Ahmed and McLaughlin, Aidan and Low, Aiden and Ostrow, AJ and Ananthram, Akhila and others , journal=
[54]

arXiv preprint arXiv:2507.20534 , year=

Kimi. arXiv preprint arXiv:2507.20534 , year=

Pith/arXiv arXiv
[55]

Yang, John and Lieret, Kilian and Ma, Jeffrey and Thakkar, Parth and Pedchenko, Dmitrii and Sootla, Sten and McMilin, Emily and Yin, Pengcheng and Hou, Rui and Synnaeve, Gabriel and others , journal=

[1] [1]

Ng and Stuart Russell , editor =

Andrew Y. Ng and Stuart Russell , editor =. Algorithms for Inverse Reinforcement Learning , booktitle =

[2] [2]

Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2025), Montreal, Canada, August 16-22, 2025 , pages=

Combining code generating large language models and self-play to iteratively refine strategies in games , author=. Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2025), Montreal, Canada, August 16-22, 2025 , pages=

2025

[3] [3]

Cognition , volume=

Action understanding as inverse planning , author=. Cognition , volume=. 2009 , publisher=

2009

[4] [4]

Naik, Atharva and Mathur, Yash and Agrawal, Darsh and Kapadnis, Manav and An, Yuwei and Marr, Clayton and Rose, Carolyn and Mortensen, David and others , journal=

[5] [5]

Griffiths , booktitle=

Jiayi Geng and Howard Chen and Dilip Arumugam and Thomas L. Griffiths , booktitle=. Are Large Language Models Reliable. 2025 , url=

2025

[6] [6]

, title =

Abbeel, Pieter and Ng, Andrew Y. , title =. Proceedings of the Twenty-First International Conference on Machine Learning , pages =. 2004 , publisher =

2004

[7] [7]

Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages =

Torabi, Faraz and Warnell, Garrett and Stone, Peter , title =. Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages =. 2018 , isbn =

2018

[8] [8]

Yang, John and Lieret, Kilian and Yang, Joyce and Jimenez, Carlos E and Press, Ofir and Schmidt, Ludwig and Yang, Diyi , journal=

[9] [9]

Simpletom: Exposing the gap between explicit tom inference and implicit tom application in

Gu, Yuling and Tafjord, Oyvind and Kim, Hyunwoo and Moore, Jared and Bras, Ronan Le and Clark, Peter and Choi, Yejin , journal=. Simpletom: Exposing the gap between explicit tom inference and implicit tom application in

[10] [10]

Nature Human Behaviour , volume=

Testing theory of mind in large language models and humans , author=. Nature Human Behaviour , volume=. 2024 , publisher=

2024

[11] [11]

Proceedings of the National Academy of Sciences , volume=

Evaluating large language models in theory of mind tasks , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

2024

[12] [12]

2026 , howpublished =

2026

[13] [13]

Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems , pages =

Hennes, Daniel and Li, Zun and Schultz, John and Lanctot, Marc , title =. Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems , pages =. 2026 , isbn =

2026

[14] [14]

The Fourteenth International Conference on Learning Representations , year=

Modeling Others' Minds as Code , author=. The Fourteenth International Conference on Learning Representations , year=

[15] [15]

Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models , url =

Cross, Logan and Xiang, Violet and Bhatia, Agam and Yamins, Daniel and Haber, Nick , booktitle =. Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models , url =

[16] [16]

Second Conference on Language Modeling , year=

Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models , author=. Second Conference on Language Modeling , year=

[17] [17]

2026 , url=

Zhining Zhang and Chuanyang Jin and Mung Yao Jia and Shunchi Zhang and Tianmin Shu , booktitle=. 2026 , url=

2026

[18] [18]

Proceedings of the 35th International Conference on Machine Learning , pages =

Machine Theory of Mind , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

2018

[19] [19]

Understanding Social Reasoning in Language Models with Language Models , url =

Gandhi, Kanishk and Fraenken, Jan-Philipp and Gerstenberg, Tobias and Goodman, Noah , booktitle =. Understanding Social Reasoning in Language Models with Language Models , url =

[20] [20]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Theory of mind for multi-agent collaboration via large language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023

[21] [21]

Forty-second International Conference on Machine Learning Position Paper Track , year=

Position: Theory of Mind Benchmarks are Broken for Large Language Models , author=. Forty-second International Conference on Machine Learning Position Paper Track , year=

[22] [22]

Nashed, Samer and Zilberstein, Shlomo , title =. J. Artif. Int. Res. , month = may, numpages =. 2022 , issue_date =

2022

[23] [23]

and Al-Shedivat, Maruan and Whiteson, Shimon and Abbeel, Pieter and Mordatch, Igor , title =

Foerster, Jakob and Chen, Richard Y. and Al-Shedivat, Maruan and Whiteson, Shimon and Abbeel, Pieter and Mordatch, Igor , title =. Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems , pages =. 2018 , publisher =

2018

[24] [24]

International Conference on Machine Learning , pages=

Model-free opponent shaping , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022

[25] [25]

Scaling Opponent Shaping to High Dimensional Games , year =

Khan, Akbir and Willi, Timon and Kwan, Newton and Tacchetti, Andrea and Lu, Chris and Grefenstette, Edward and Rockt\". Scaling Opponent Shaping to High Dimensional Games , year =. Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems , pages =

[26] [26]

The Thirteenth International Conference on Learning Representations , year=

Advantage Alignment Algorithms , author=. The Thirteenth International Conference on Learning Representations , year=

[27] [27]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Opponent Modeling with In-context Search , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[28] [28]

Yu, XiaoPeng and Zhang, Wanpeng and Lu, Zongqing , booktitle=

[29] [29]

Workshop on Multi-Agent Learning and Its Opportunities in the Era of Generative AI , year=

Scaling Inference-Time Computation via Opponent Simulation: Enabling Online Strategic Adaptation in Repeated Negotiation , author=. Workshop on Multi-Agent Learning and Its Opportunities in the Era of Generative AI , year=

[30] [30]

Nature Human Behaviour , volume=

Playing repeated games with large language models , author=. Nature Human Behaviour , volume=. 2025 , publisher=

2025

[31] [31]

2024 , url=

Jinhao Duan and Renming Zhang and James Diffenderfer and Bhavya Kailkhura and Lichao Sun and Elias Stengel-Eskin and Mohit Bansal and Tianlong Chen and Kaidi Xu , booktitle=. 2024 , url=

2024

[32] [32]

2024 , url=

Anthony Costarelli and Mat Allen and Roman Hauksson and Grace Sodunke and Suhas Hariharan and Carlson Cheng and Wenjie Li and Joshua M Clymer and Arjun Yadav , booktitle=. 2024 , url=

2024

[33] [33]

Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of

Giorgio Piatti and Zhijing Jin and Max Kleiman-Weiner and Bernhard Sch. Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[34] [34]

arXiv preprint arXiv:2509.17158 , year=

Froger, Romain and Andrews, Pierre and Bettini, Matteo and Budhiraja, Amar and Cabral, Ricardo Silveira and Do, Virginie and Garreau, Emilien and Gaya, Jean-Baptiste and Lauren. arXiv preprint arXiv:2509.17158 , year=

arXiv

[35] [35]

Cheng, Haowei and Kim, Milhan and Khomh, Foutse and Racharak, Teeradaj and Yoshioka, Nobukazu and Ubayashi, Naoyasu and Washizaki, Hironori , journal=

[36] [36]

Nature , volume=

Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=

2024

[37] [37]

Yamada, Yutaro and Lange, Robert Tjarko and Lu, Cong and Hu, Shengran and Lu, Chris and Foerster, Jakob and Clune, Jeff and Ha, David , journal=. The

[38] [38]

First Workshop on Foundations of Reasoning in Language Models , year=

Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction , author=. First Workshop on Foundations of Reasoning in Language Models , year=

[39] [39]

Anjiang Wei and Tarun Suresh and Jiannan Cao and Naveen Kannan and Yuheng Wu and Kai Yan and Thiago S. F. X. Teixeira and Ke Wang and Alex Aiken , booktitle=. Code. 2025 , url=

2025

[40] [40]

2026 , url=

Deepro Choudhury and Sinead Williamson and Adam Golinski and Ning Miao and Freddie Bickford Smith and Michael Kirchhof and Yizhe Zhang and Tom Rainforth , booktitle=. 2026 , url=

2026

[41] [41]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Probing the multi-turn planning capabilities of LLMs via 20 question games , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[42] [42]

doi:10.52202/079017-2243 , editor =

Tang, Hao and Key, Darren and Ellis, Kevin , booktitle =. doi:10.52202/079017-2243 , editor =

work page doi:10.52202/079017-2243

[43] [43]

Reddy , booktitle=

Parshin Shojaee and Kazem Meidani and Shashank Gupta and Amir Barati Farimani and Chandan K. Reddy , booktitle=. 2025 , url=

2025

[44] [44]

1983 , publisher=

Representing and intervening: Introductory topics in the philosophy of natural science , author=. 1983 , publisher=

1983

[45] [45]

2024 , url=

John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

2024

[46] [46]

Second Conference on Language Modeling , year=

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility , author=. Second Conference on Language Modeling , year=

[47] [47]

, journal =

Wang, Sida I. , journal =. Measuring all the noises of

[48] [48]

arXiv preprint arXiv:2007.10504 , year=

Battlesnake challenge: A multi-agent reinforcement learning playground with human-in-the-loop , author=. arXiv preprint arXiv:2007.10504 , year=

arXiv 2007

[49] [49]

2016 , url =

Michael Truell and Benjamin Spector , title =. 2016 , url =

2016

[50] [50]

2025 , url =

Bhavesh Kumar and Hoang Nguyen and Roger Jin , title =. 2025 , url =

2025

[51] [51]

Journal of Computing Sciences in Colleges , volume=

Robocode: using games to teach artificial intelligence , author=. Journal of Computing Sciences in Colleges , volume=. 2004 , publisher=

2004

[52] [52]

2020 , url =

Outkine, Anton and Oxer, Noa , title =. 2020 , url =

2020

[53] [53]

Singh, Aaditya and Fry, Adam and Perelman, Adam and Tart, Adam and Ganesh, Adi and El-Kishky, Ahmed and McLaughlin, Aidan and Low, Aiden and Ostrow, AJ and Ananthram, Akhila and others , journal=

[54] [54]

arXiv preprint arXiv:2507.20534 , year=

Kimi. arXiv preprint arXiv:2507.20534 , year=

Pith/arXiv arXiv

[55] [55]

Yang, John and Lieret, Kilian and Ma, Jeffrey and Thakkar, Parth and Pedchenko, Dmitrii and Sootla, Sten and McMilin, Emily and Yin, Pengcheng and Hou, Rui and Synnaeve, Gabriel and others , journal=