arxiv: 2604.07733 · v1 · submitted 2026-04-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V

John Chen , Sihan Cheng , Can Gurkan , Mingyi Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsstrategic decision makingCivilization Vbenchmark evaluationvictory probabilitymulti-agent gameslong-horizon strategy

0 comments

The pith

CivBench evaluates LLM strategic decisions in Civilization V using turn-by-turn victory probability estimates instead of final win or loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CivBench as a way to test how well language model agents handle strategy in long, competitive games like Civilization V. Because full games take many turns and end with only one winner, final results give little detail on what happened along the way. The benchmark trains a predictor to say how likely a player is to win from any given game state at a particular turn. This creates a continuous measure of progress that can show how different models and setups perform over time. Tests with seven models across hundreds of games indicate that this reveals unique strategic behaviors and effects from how the agents are configured, which final scores alone cannot show.

Core claim

By training models to estimate victory probabilities from turn-level game states in multiplayer Civilization V, CivBench enables dense evaluation of strategic decision-making in LLM agents, with validity checks confirming the estimates track actual outcomes and strategic quality, leading to identification of distinct profiles across models and agent conditions in 307 games.

What carries the argument

A victory probability model trained on game state features at each turn to produce ongoing estimates of winning chances.

If this is right

Strategic capabilities of LLMs can be tracked and compared throughout extended play rather than only at the end.
Effects of different agent setups on decision-making become measurable for each model.
Strategic profiles that differ from what outcome metrics show can be outlined.
Unsaturated benchmarks for long-horizon multi-agent strategy are possible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents could receive real-time feedback during play to improve their strategies iteratively.
Similar progress-based signals might apply to evaluating strategy in other complex games with sparse rewards.
Comparison between LLM profiles and human player data could highlight where models excel or fall short.

Load-bearing premise

Victory probability estimates from game states serve as a valid and rich enough signal for measuring the quality of strategic decisions.

What would settle it

A direct test showing that the estimated probabilities do not reliably predict who actually wins the games or do not align with independent measures of strategic strength would undermine the approach.

Figures

Figures reproduced from arXiv: 2604.07733 by Can Gurkan, John Chen, Mingyi Lin, Sihan Cheng.

**Figure 2.** Figure 2: Overall (first column) and per-strategy (rest columns) ELO ratings. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Most chosen strategy, most pivoted-to strategy, and best-ELO strategy. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Brier score and log loss by turn-progress decile for all eight models. All models [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Within-game rank agreement by turn progress. Agreement is high overall and [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Group permutation importance for AttentionMLP by feature group and eventual [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Empirical head-to-head outperform probabilities. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Strategic commitment relative to Vanilla (positive = more concentrated). [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Strategy time allocation (mean share of game time) by player type. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Win probability at strategy pivot. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Pivot-flow heatmaps by player type. Each facet shows from [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Ablation: Sonnet-4.5-Simple [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Ablation: Qwen-3.5-Simple [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Ablation: Kimi-K2.5-Simple A.3.7 LLM Inference Costs Cost estimation is derived from OpenRouter as of Mar 26, 2026 and does not include input cache or batch inference discounts. Total = $10,497. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Distribution of focus briefer count per game for briefed configurations. A.4 Logged Signal Definitions CivBench logs two classes of signals: game-state signals recorded each turn from the observable game state, and agentic decision signals recorded when the LLM makes a decision. A.4.1 Game-state Signals PlayerSummaries: Per-player per-turn snapshot of competitive standing and resource state. 19 [PITH_FUL… view at source ↗

read the original abstract

Evaluating strategic decision-making in LLM-based agents requires generative, competitive, and longitudinal environments, yet few benchmarks provide all three, and fewer still offer evaluation signals rich enough for long-horizon, multi-agent play. We introduce CivBench, a benchmark for LLM strategists (i.e., agentic setups) in multiplayer Civilization V. Because terminal win/loss is too sparse a signal in games spanning hundreds of turns and multiple opponents, CivBench trains models on turn-level game state to estimate victory probabilities throughout play, validated through predictive, construct, and convergent validity. Across 307 games with 7 LLMs and multiple CivBench agent conditions, we demonstrate CivBench's potential to estimate strategic capabilities as an unsaturated benchmark, reveal model-specific effects of agentic setup, and outline distinct strategic profiles not visible through outcome-only evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CivBench gives a practical new benchmark for LLM agents in a long-horizon game like Civilization V, but the victory-probability signal may mostly track game progress rather than the quality of the agents' decisions.

read the letter

CivBench sets up LLM agents inside multiplayer Civilization V and replaces the sparse win/loss outcome with turn-by-turn victory probability estimates trained on game states. That is the core new piece: a generative, competitive environment that runs for hundreds of turns and supplies denser feedback for strategic play. They ran 307 games across seven models and several agent configurations, and the abstract claims this surfaces model-specific patterns and distinct profiles that final scores alone do not show. The move away from terminal outcomes is a clear improvement for this kind of setting, and the scale of the experiments is reasonable for a first benchmark paper. The validity checks listed (predictive, construct, convergent) are the right categories to attempt. The soft spot is exactly the one the stress-test note flags. The estimator is trained on full game states that already contain progress markers such as city counts, unit totals, and territory. Without reported ablations that hold the state fixed and vary only the agent's choices, or counterfactual edits that test whether the probability shifts with better or worse decisions, it is hard to know how much of the signal is the LLM's planning versus the simple fact that one side is ahead. If the full paper contains those controls, the concern shrinks; from the abstract it remains open. The work is aimed at people building and evaluating LLM agents in complex, multi-agent environments. Anyone running agent benchmarks or looking for longitudinal testbeds would find the setup and the progress-based scoring idea worth reading. It is worth sending to peer review because new, runnable environments in this area are still rare and the basic framing is sound, even if the current validation needs tightening before the stronger claims land.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CivBench, a benchmark for LLM-based agents in multiplayer Civilization V. It replaces sparse terminal win/loss signals with turn-level victory probability estimates trained on game-state features, validated via predictive, construct, and convergent validity checks. Experiments across 307 games with 7 LLMs under multiple agent conditions are used to argue that the benchmark is unsaturated, reveals model-specific effects of agentic setups, and identifies distinct strategic profiles invisible to outcome-only metrics.

Significance. If the validity of the progress-based proxy holds, the work provides a denser evaluation signal for long-horizon strategic capabilities in generative multi-agent settings, addressing a recognized gap in current LLM agent benchmarks. The scale (307 games) and use of three distinct validity types are strengths that could support more granular analysis of planning and opponent modeling than win rates alone.

major comments (2)

[§3.2] §3.2 (Victory Probability Estimator): The central claim that these estimates measure LLM-specific strategic decision-making rests on the assumption that the model captures more than aggregate progress metrics (e.g., unit counts, city development). No ablation on decision features or counterfactual state edits is described to isolate the contribution of LLM actions from general game-state progress, which directly bears on whether model-specific effects and distinct profiles can be attributed to strategy rather than shared progress signals.
[§4.3] §4.3 (Strategic Profiles): The reported distinct profiles and model-specific agent effects are derived from the probability trajectories, yet the manuscript provides no statistical comparison (e.g., significance tests or variance decomposition) against outcome-only baselines or non-LLM agents to confirm these profiles are not artifacts of the estimator's sensitivity to state progress.

minor comments (2)

[Abstract] The abstract states results from 'multiple CivBench agent conditions' without enumerating them; a brief list would improve readability.
[§4] Figure legends in §4 could more explicitly label the different LLMs and agent variants to aid interpretation of the probability curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas for strengthening the validation of our progress-based evaluation. We respond to each major comment below and describe the revisions we will make.

read point-by-point responses

Referee: [§3.2] §3.2 (Victory Probability Estimator): The central claim that these estimates measure LLM-specific strategic decision-making rests on the assumption that the model captures more than aggregate progress metrics (e.g., unit counts, city development). No ablation on decision features or counterfactual state edits is described to isolate the contribution of LLM actions from general game-state progress, which directly bears on whether model-specific effects and distinct profiles can be attributed to strategy rather than shared progress signals.

Authors: We appreciate the referee highlighting this assumption. Our validation already includes predictive accuracy on held-out games, construct validity through alignment with known strategic indicators, and convergent validity with outcome metrics. Nevertheless, we agree that targeted ablations and counterfactual analyses would more directly isolate LLM decision contributions. In the revised manuscript we will add an ablation study training estimator variants with and without features tied to LLM actions (e.g., unit movement sequences and city-build decisions) and quantify performance degradation. We will also report results from feasible counterfactual state edits, generated by replaying games with perturbed LLM-derived actions while holding other state elements fixed, to demonstrate sensitivity to strategic choices beyond aggregate progress. revision: yes
Referee: [§4.3] §4.3 (Strategic Profiles): The reported distinct profiles and model-specific agent effects are derived from the probability trajectories, yet the manuscript provides no statistical comparison (e.g., significance tests or variance decomposition) against outcome-only baselines or non-LLM agents to confirm these profiles are not artifacts of the estimator's sensitivity to state progress.

Authors: We concur that statistical confirmation is needed to rule out artifacts. The observed profiles already differ visibly from win-rate outcomes across the 307 games, but we will strengthen this in revision. We will add formal statistical comparisons, including permutation tests and ANOVA on trajectory features (e.g., slope, variance, and inflection points) against both outcome-only baselines and non-LLM agents (rule-based and random). We will also include variance decomposition to partition the explained variance attributable to model-specific strategy versus general state progress, with results reported in the updated §4.3. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper trains a victory probability estimator on turn-level game states and applies it to evaluate LLM agents across 307 games, with validity established via separate predictive, construct, and convergent checks. No equations, self-citations, or definitional steps are shown that reduce the central claims (model-specific effects and distinct profiles) back to the training inputs by construction. The estimator serves as an independent proxy rather than a tautological renaming or fitted prediction of the LLM behaviors themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities can be identified from the abstract alone; the approach implicitly assumes the validity of probability estimation from game states.

pith-pipeline@v0.9.0 · 5440 in / 1067 out tokens · 37166 ms · 2026-05-10T17:55:27.394970+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CivBench trains ML models on turn-level game state to estimate victory probabilities... 23 game-state features spanning technology, policy, economy, military, diplomacy, and culture... AttentionMLP... Bradley-Terry ELO ratings
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define strategic capability as a strategist’s ability to convert decisions into winning potential... progress-weighted average of turn-level estimates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 35 canonical work pages · 3 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

https://www.reddit.com/r/civ5/comments/fklkzw/the_way_scores_are_calculated_bothers_me/, 2020

The way scores are calculated bothers me . https://www.reddit.com/r/civ5/comments/fklkzw/the_way_scores_are_calculated_bothers_me/, 2020. Community forum source. Accessed 2026-03-28

2020
[3]

https://forums.civfanatics.com/threads/650-ai-game-4uc-stats-analysis.689648/page-2, 2024

650 AI Game 4UC Stats / Analysis . https://forums.civfanatics.com/threads/650-ai-game-4uc-stats-analysis.689648/page-2, 2024. Community forum source. Accessed 2026-03-28

2024
[4]

Mubashara Akhtar, Anka Reuel, Prajna Soni, Sanchit Ahuja, Pawan Sasanka Ammanamanchi, Ruchit Rawal, Vilém Zouhar, Srishti Yadav, Chenxi Whitehouse, Dayeon Ki, Jennifer Mickel, Leshem Choshen, Marek Šuppa, Jan Batzner, Jenny Chim, Jeba Sania, Yanan Long, Hossein A. Rahmani, Christina Knight, Yiyang Nan, Jyoutir Raj, Yu Fan, Shubham Singh, Subramanyam Sahoo...

work page arXiv 2026
[5]

Akiba, S

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '19, pp.\ 2623--2631, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi...

work page doi:10.1145/3292500.3330701 2019
[6]

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Norah Alzahrani, Hisham Alyahya, Yazeed Alnumay, Sultan AlRashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, M Saiful Bari, and Haidar Khan. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Procee...

work page doi:10.18653/v1/2024.acl-long.744 2024
[7]

Harnessing language for coordination: A framework and benchmark for llm-driven multiagent control

Timothée Anne, Noah Syrkis, Meriem Elhosni, Florian Turati, Franck Legendre, Alain Jaquier, and Sebastian Risi. Harnessing language for coordination: A framework and benchmark for llm-driven multiagent control. IEEE Transactions on Games, 17 0 (4): 0 933--943, 2025. doi:10.1109/TG.2025.3564042

work page doi:10.1109/tg.2025.3564042 2025
[8]

arXiv preprint arXiv:2502.15840 , year =

Axel Backlund and Lukas Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents, 2025. URL https://arxiv.org/abs/2502.15840

work page arXiv 2025
[9]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952. doi:10.1093/biomet/39.3-4.324. URL https://doi.org/10.1093/biomet/39.3-4.324

work page doi:10.1093/biomet/39.3-4.324 1952
[10]

Vox deorum: A hybrid llm architecture for 4x / grand strategy game ai -- lessons from civilization v, 2025 a

John Chen, Sihan Cheng, Can Gurkan, Ryan Lay, and Moez Salahuddin. Vox deorum: A hybrid llm architecture for 4x / grand strategy game ai -- lessons from civilization v, 2025 a . URL https://arxiv.org/abs/2512.18564

work page arXiv 2025
[11]

LLMA rena: Assessing capabilities of large language models in dynamic multi-agent environments

Junzhe Chen, Xuming Hu, Shuodi Liu, Shiyu Huang, Wei-Wei Tu, Zhaofeng He, and Lijie Wen. LLMA rena: Assessing capabilities of large language models in dynamic multi-agent environments. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp...

work page doi:10.18653/v1/2024.acl-long.705 2024
[12]

Benchmarking large language models under data contamination: A survey from static to dynamic evaluation

Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. Benchmarking large language models under data contamination: A survey from static to dynamic evaluation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.), Proceedings of the 2025 Conf...

work page doi:10.18653/v1/2025.emnlp-main.511 2025
[13]

Proceedings of the 22nd

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pp.\ 785--794, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342322. doi:10.1145/2939672.2939785. URL https://doi.org/10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[14]

Gamebench: Evaluating strategic reasoning abilities of llm agents, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents, 2024. URL https://arxiv.org/abs/2406.06613

work page arXiv 2024
[15]

Investigating data contamination in modern benchmarks for large language models

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volu...

work page doi:10.18653/v1/2024.naacl-long.482 2024
[16]

Human-level play in the game of Diplomacy by combining language models with strategic reasoning.Science, 378 (6624):1067–1074, 2022

Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiya...

work page doi:10.1126/science.ade9097 2022
[17]

A gent Q uest: A modular benchmark framework to measure progress and improve LLM agents

Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril Gashteovski, David Friede, Roberto Bifulco, and Carolin Lawrence. A gent Q uest: A modular benchmark framework to measure progress and improve LLM agents. In Kai-Wei Chang, Annie Lee, and Nazneen Rajani (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association fo...

work page doi:10.18653/v1/2024.naacl-demo.19 2024
[18]

Multilayer feedforward networks are universal approximators , journal =

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2 0 (5): 0 359--366, 1989. ISSN 0893-6080. doi:https://doi.org/10.1016/0893-6080(89)90020-8. URL https://www.sciencedirect.com/science/article/pii/0893608089900208

work page doi:10.1016/0893-6080(89)90020-8 1989
[19]

Ecogym: Evaluating llms for long-horizon plan-and-execute in interactive economies, 2026

Xavier Hu, Jinxiang Xia, Shengze Xu, Kangqi Song, Yishuo Yuan, Guibin Zhang, JinCheng Ren, Boyu Feng, Li Lu, Tieyong Zeng, Jiaheng Liu, Minghao Liu, He Zhu, Yuchen Eleanor Jiang, Wei Wang, and Wangchunshu Zhou. Ecogym: Evaluating llms for long-horizon plan-and-execute in interactive economies, 2026. URL https://arxiv.org/abs/2602.09514

work page internal anchor Pith review arXiv 2026
[20]

Openskill: A faster asymmetric multi-team, multiplayer rating system

Vivek Joshy. Openskill: A faster asymmetric multi-team, multiplayer rating system. Journal of Open Source Software, 9 0 (93): 0 5901, 2024. doi:10.21105/joss.05901. URL https://doi.org/10.21105/joss.05901

work page doi:10.21105/joss.05901 2024
[21]

Dynabench: Rethinking benchmarking in NLP

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP . In Kristina Toutanova, ...

work page doi:10.18653/v1/2021.naacl-main.324 2021
[22]

Liveagentbench: Comprehensive benchmarking of agentic systems across 104 real-world challenges, 2026

Hao Li, Huan Wang, Jinjie Gu, Wenjie Wang, Chenyi Zhuang, and Sikang Bian. Liveagentbench: Comprehensive benchmarking of agentic systems across 104 real-world challenges, 2026. URL https://arxiv.org/abs/2603.02586

work page arXiv 2026
[23]

An open-source data contamination report for large language models

Yucheng Li, Yunhao Guo, Frank Guerin, and Chenghua Lin. An open-source data contamination report for large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 528--541, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10....

work page doi:10.18653/v1/2024.findings-emnlp.30 2024
[24]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi

2024
[25]

Agentbench: Evaluating LLM s as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLM s as agents. In The Twelfth International Conference on Learning...

2024
[26]

Agentboard: An analytical evaluation board of multi- turn llm agents

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp.\ 74325--743...

work page doi:10.52202/079017-2365 2024
[27]

Large language models play starcraft ii:benchmarks and a chain of summarization approach

Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Yuqiao Wu, Runji Lin, Haifeng Zhang, and Jun Wang. Large language models play starcraft ii:benchmarks and a chain of summarization approach. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp.\ 133386--133...

work page doi:10.52202/079017-4240 2024
[28]

Hervé Moulin

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. Evaluation and benchmarking of llm agents: A survey. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, KDD '25, pp.\ 6129--6139, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400714542. doi:10.1145/3711896.3736570. URL https://doi.org/...

work page doi:10.1145/3711896.3736570 2025
[29]

OpenAI, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d. O. Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sut...

work page internal anchor Pith review arXiv 2019
[30]

BALROG : Benchmarking agentic LLM and VLM reasoning on games

Davide Paglieri, Bart omiej Cupia , Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, ukasz Kuci \'n ski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rockt \"a schel. BALROG : Benchmarking agentic LLM and VLM reasoning on games. In The Thirteenth International Conference on Learning Represe...

2025
[31]

Scaling strategy, not compute: A stand-alone, open-source starcraft ii benchmark for accessible reinforcement learning research, 2026

Sourav Panda, Shreyash Kale, Tanmay Ambadkar, Abhinav Verma, and Jonathan Dodge. Scaling strategy, not compute: A stand-alone, open-source starcraft ii benchmark for accessible reinforcement learning research, 2026. URL https://arxiv.org/abs/2603.06608

work page arXiv 2026
[32]

Civrealm: A learning and reasoning odyssey in civilization for decision-making agents

Siyuan Qi, Shuo Chen, Yexin Li, Xiangyu Kong, Junqi Wang, Bangcheng Yang, Pring Wong, Yifan Zhong, Xiaoyuan Zhang, Zhaowei Zhang, Nian Liu, Yaodong Yang, and Song-Chun Zhu. Civrealm: A learning and reasoning odyssey in civilization for decision-making agents. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview...

2024
[33]

A bayesian approach to in-game win probability in soccer

Pieter Robberechts, Jan Van Haaren, and Jesse Davis. A bayesian approach to in-game win probability in soccer. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD '21, pp.\ 3512--3521, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383325. doi:10.1145/3447548.3467194. URL https://doi.org/10...

work page doi:10.1145/3447548.3467194 2021
[34]

Workbench: a benchmark dataset for agents in a realistic workplace setting

Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor Sanchez, and Bertie Vidgen. Workbench: a benchmark dataset for agents in a realistic workplace setting. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=4HNAwZFDcH

2024
[35]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https:...

2017
[36]

Nature , author =

Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibar...

work page doi:10.1038/s41586-019-1724-z 2019
[37]

arXiv preprint arXiv:2410.10479 , year =

Haochuan Wang, Xiachong Feng, Lei Li, Yu Guo, Zhanyue Qin, Dianbo Sui, and Lingpeng Kong. Tmgbench: A systematic game benchmark for evaluating strategic reasoning abilities of llms, 2025 a . URL https://arxiv.org/abs/2410.10479

work page arXiv 2025
[38]

Digital player: Evaluating large language models based human-like agent in games, 2025 b

Jiawei Wang, Kai Wang, Shaojie Lin, Runze Wu, Bihan Xu, Lingeng Jiang, Shiwei Zhao, Renyu Zhu, Haoyu Liu, Zhipeng Hu, Zhong Fan, Le Li, Tangjie Lyu, and Changjie Fan. Digital player: Evaluating large language models based human-like agent in games, 2025 b . URL https://arxiv.org/abs/2502.20807

work page arXiv 2025
[39]

arXiv preprint arXiv:2408.15971 , year =

Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, and Jie Tang. Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems, 2024. URL https://arxiv.org/abs/2408.15971

work page arXiv 2024
[40]

emnlp-industry.37/

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In A. Globerson, L. Mackey, D. Bel...

work page doi:10.52202/079017-1650 2024
[41]

SPIN -bench: How well do LLM s plan strategically and reason socially? In Second Conference on Language Modeling, 2025

Jianzhu Yao, Kevin Wang, Ryan Hsieh, Haisu Zhou, Tianqing Zou, Zerui Cheng, Zhangyang Wang, and Pramod Viswanath. SPIN -bench: How well do LLM s plan strategically and reason socially? In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=mgsS73kvOA

2025
[42]

Survey on Evaluation of LLM-based Agents

Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli-Scheuer. Survey on evaluation of llm-based agents, 2025. URL https://arxiv.org/abs/2503.16416

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Deep sets

Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/pap...

2017
[44]

Zhang, S

Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, and Junyang Lin. Deepplanning: Benchmarking long-horizon agentic planning with verifiable constraints, 2026. URL https://arxiv.org/abs/2601.18137

work page arXiv 2026
[45]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2024 a . URL https://openreview.net/forum?id=oKn9c6ytLx

2024
[46]

SOTOPIA : Interactive evaluation for social intelligence in language agents

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. SOTOPIA : Interactive evaluation for social intelligence in language agents. In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=mM7VurbA4r

2024
[47]

MultiAgentBench : Evaluating the collaboration and competition of LLM agents

Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, and Jiaxuan You. M ulti A gent B ench : Evaluating the collaboration and competition of LLM agents. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of ...

work page doi:10.18653/v1/2025.acl-long.421 2025
[48]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[49]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[50]

""""""""

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.5555/3737916.3739566 2026