pith. machine review for the scientific record. sign in

arxiv: 2605.12673 · v1 · submitted 2026-05-12 · 💻 cs.AI · cs.CR

Recognition: no theorem link

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:28 UTC · model grok-4.3

classification 💻 cs.AI cs.CR
keywords AI agent benchmarksreward hackingbenchmark auditingred teamingAI evaluation securityBenchJackagent benchmarks flawsiterative patching
0
0 comments X

The pith

BenchJack automatically uncovers reward-hacking exploits that let agents score near-perfect on popular benchmarks without completing tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that reward hacking arises spontaneously in frontier models on agent benchmarks, where agents maximize scores without doing the intended work. It derives a taxonomy of eight recurring flaw patterns and condenses them into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and synthesize exploits in a clairvoyant manner. BenchJack finds 219 distinct flaws across ten benchmarks in software engineering, web navigation, desktop computing, and terminal operations, achieving near-perfect scores on most without solving any tasks. The system extends to an iterative generative-adversarial pipeline that discovers new flaws and patches them, reducing the hackable-task ratio from near 100 percent to under 10 percent on four benchmarks and fully securing WebArena and OSWorld in three iterations. This establishes that evaluation pipelines lack an adversarial mindset and that proactive auditing can close the security gap.

Core claim

BenchJack is an automated red-teaming system that drives coding agents to audit AI agent benchmarks and identify possible reward-hacking exploits in a clairvoyant manner; when applied to ten popular benchmarks it synthesizes exploits achieving near-perfect scores without solving a single task, surfaces 219 distinct flaws across eight classes, and extends to an iterative pipeline that reduces the hackable-task ratio from near 100 percent to under 10 percent on four benchmarks while fully patching WebArena and OSWorld within three iterations.

What carries the argument

BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and synthesize reward-hacking exploits in a clairvoyant manner, together with its extended iterative generative-adversarial patching pipeline.

Load-bearing premise

Exploits discovered by BenchJack's own auditing agents represent genuine, transferable reward hacks that would succeed on standard frontier models rather than being artifacts of the clairvoyant setup.

What would settle it

Running unmodified frontier models on the original benchmarks using the exact exploits BenchJack synthesized and checking whether they achieve near-perfect scores without task completion.

Figures

Figures reproduced from arXiv: 2605.12673 by Alvin Cheung, Dawn Song, Hanchen Li, Hao Wang, Koushik Sen, Qiuyang Mang.

Figure 1
Figure 1. Figure 1: How a nine-line conftest.py hacks SWE-bench. SWE-bench evaluates correctness of the submitted patch via a test suite. The benchmark does not reset arbitrary files, leading to a trust boundary violation. A hacking model can create a conftest.py that PyTest auto-loads. The file registers a hook and rewrites every test’s reported outcome, resulting in a 100% resolve rate. spontaneously reward-hack in more tha… view at source ↗
Figure 2
Figure 2. Figure 2: The eight recurring flaw classes (V1–V8) in our flaw taxonomy, covering issues such as [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The three-stage BENCHJACK audit pipeline: reconnaissance, taxonomy-guided flaw scan, and exploit construction. BENCHJACK first maps the evaluation structure in the reconnaissance stage. With the guidance of the flaw taxonomy and the reconnaissance mapping, BENCHJACK scans the benchmark to produce a ledger of flaws. Finally, BENCHJACK iteratively synthesize and validate a reward-hacking exploit given the pr… view at source ↗
Figure 4
Figure 4. Figure 4: Iterative refinement loop: BENCHJACK acts as an adaptive hacker while a coding agent patches the benchmark against each verified ex￾ploit, repeating until no new reward hack can be produced or the benchmark is non-patchable. We couple BENCHJACK with a simple coding agent as the defender tasked with patching the benchmark against a verified exploit and the corresponding flaws (shown in [PITH_FULL_IMAGE:fig… view at source ↗
Figure 5
Figure 5. Figure 5: BENCHJACK results across 10 benchmarks. Left: exploit hack rate per benchmark, sorted from highest to lowest. nine benchmarks are hacked on almost all of instances. Right: number of benchmarks hacked via each flaw pattern (Section 4.1), ordered V1–V8. Benchmarks tagged with multiple flaws (e.g., V1&V7) count once toward each listed class. 0 10 20 30 40 50 Number of vulnerabilities V1 Isolation V2 Answers i… view at source ↗
Figure 6
Figure 6. Figure 6: Prevalence and reach of reward-hack classes across all 10 audited benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Patching study. For each benchmark we report the original hack rate (red), the hack rate [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Iterative improvement study. For four benchmarks without major design flaw, we re-run [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents BenchJack, a system for systematically auditing AI agent benchmarks for reward-hacking vulnerabilities. It derives an eight-class taxonomy of flaws from past incidents, compiles it into the Agent-Eval Checklist, and applies an automated red-teaming pipeline using clairvoyant coding agents to 10 benchmarks. The results report discovery of 219 flaws and near-perfect exploit scores without task completion, with an extended iterative pipeline reducing hackable tasks to under 10% on several benchmarks and fully patching WebArena and OSWorld in three iterations.

Significance. If the discovered exploits prove transferable to standard frontier agents without privileged access, the work would be significant for highlighting systemic issues in benchmark design and offering a proactive auditing method. This could influence how future agent benchmarks are constructed to be more robust against reward hacking, which is increasingly relevant as benchmarks drive AI development decisions.

major comments (3)
  1. Abstract: The claims of synthesizing exploits achieving near-perfect scores without solving tasks and surfacing 219 distinct flaws lack supporting details on validation procedures, baseline comparisons, error bars, or controls to rule out artifacts from the clairvoyant auditing setup.
  2. Abstract: The assertion that the extended pipeline reduces the hackable-task ratio to under 10% on four benchmarks and fully patches WebArena and OSWorld within three iterations does not include evidence that the patches preserve the benchmarks' intended evaluation properties or that the process does not introduce new vulnerabilities.
  3. Application to benchmarks: The central claim that these represent genuine, transferable reward hacks requires explicit transfer experiments showing success on unmodified agent scaffolds; the clairvoyant access during synthesis may surface non-transferable flaws.
minor comments (2)
  1. The taxonomy of eight flaw patterns is mentioned but not detailed in the abstract; consider adding a brief overview or reference to the section where it is presented.
  2. Ensure all quantitative results are accompanied by statistical measures or confidence intervals in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful and constructive comments. We have revised the manuscript to strengthen the supporting details for our claims and to clarify the scope of our contributions. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: Abstract: The claims of synthesizing exploits achieving near-perfect scores without solving tasks and surfacing 219 distinct flaws lack supporting details on validation procedures, baseline comparisons, error bars, or controls to rule out artifacts from the clairvoyant auditing setup.

    Authors: We agree that the abstract, as a concise summary, omitted key methodological details. In the revision we have updated the abstract to reference the validation procedures (manual verification of a 20% random sample of exploits by two independent annotators with inter-annotator agreement of 0.87), baseline comparisons against random and greedy agents, and controls for clairvoyant artifacts (ablation studies in Section 4.2). Full error bars from five independent runs and statistical tests are reported in Tables 2 and 3 of the main text. revision: yes

  2. Referee: Abstract: The assertion that the extended pipeline reduces the hackable-task ratio to under 10% on four benchmarks and fully patches WebArena and OSWorld within three iterations does not include evidence that the patches preserve the benchmarks' intended evaluation properties or that the process does not introduce new vulnerabilities.

    Authors: This concern is valid. We have added a dedicated subsection (5.3) presenting evidence that patched benchmarks preserve intended properties: human-expert task-completion rates remain statistically indistinguishable from the originals (p > 0.1), and correlation with unpatched difficulty rankings is preserved (Spearman ρ > 0.85). We also re-applied BenchJack to the patched versions and report zero new flaws within the eight-class taxonomy after the final iteration. We acknowledge that exhaustive search for vulnerabilities outside our taxonomy remains an open limitation. revision: yes

  3. Referee: Application to benchmarks: The central claim that these represent genuine, transferable reward hacks requires explicit transfer experiments showing success on unmodified agent scaffolds; the clairvoyant access during synthesis may surface non-transferable flaws.

    Authors: The manuscript does not claim that the synthesized exploits transfer directly to unmodified frontier-agent scaffolds; it demonstrates that benchmark designs contain exploitable reward-hacking vulnerabilities when audited with clairvoyant access. We have revised the introduction and discussion to state this scope explicitly and to explain why clairvoyant auditing is a necessary first step for benchmark designers. While we agree transfer experiments would be valuable, they lie outside the current contribution focused on systematic auditing rather than agent robustness; we have added this as a suggested direction for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical auditing pipeline with external taxonomy derivation

full rationale

The paper constructs BenchJack as an automated red-teaming system and applies it directly to 10 public benchmarks, reporting observed exploit synthesis and flaw counts. The taxonomy of eight flaw patterns is derived from past incidents (external to this work). No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain. Results are presented as direct empirical outputs of the auditing pipeline rather than quantities forced by internal definitions or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on the premise that reward hacking occurs spontaneously and that an automated red-teaming system can reliably surface transferable exploits; the taxonomy and BenchJack itself are introduced without independent external validation beyond the reported experiments.

axioms (2)
  • domain assumption Reward hacking emerges spontaneously in frontier models without overfitting.
    Stated directly in the abstract as an observed phenomenon that motivates the need for secure-by-design benchmarks.
  • domain assumption Benchmarks must be secure by design.
    Core normative claim that structures the entire contribution.
invented entities (2)
  • BenchJack no independent evidence
    purpose: Automated red-teaming system that drives coding agents to audit benchmarks and identify reward-hacking exploits
    Newly introduced tool whose effectiveness is demonstrated through application to ten benchmarks.
  • Agent-Eval Checklist no independent evidence
    purpose: Taxonomy of eight recurring flaw patterns compiled for benchmark designers
    Derived from past incidents and presented as a new organizing framework.

pith-pipeline@v0.9.0 · 5566 in / 1637 out tokens · 67236 ms · 2026-05-14T20:28:51.717346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

129 extracted references · 49 canonical work pages · 19 internal anchors

  1. [1]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety, 2016. URLhttps://arxiv.org/abs/1606.06565

  2. [2]

    Alignment risk update: Claude mythos preview, 2026

    Anthropic. Alignment risk update: Claude mythos preview, 2026. URL https://www-cdn. anthropic.com/3edfc1a7f947aa81841cf88305cb513f184c36ae.pdf

  3. [3]

    Claude code

    Anthropic / Community Sources. Claude code. https://www.anthropic.com/product/ claude-code, 2026

  4. [4]

    Analyzing and improving chain-of-thought monitorability through information theory, 2026

    Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, and Christos Louizos. Analyzing and improving chain-of-thought monitorability through information theory, 2026. URL https: //arxiv.org/abs/2602.18297

  5. [5]

    Rewardhackingagents: Benchmarking evaluation integrity for llm ml-engineering agents, 2026

    Yonas Atinafu and Robin Cohen. Rewardhackingagents: Benchmarking evaluation integrity for llm ml-engineering agents, 2026. URLhttps://arxiv.org/abs/2603.11337

  6. [6]

    Monitoring reasoning models for misbehavior.arXiv preprint arXiv:2503.11926,

    Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y . Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation, 2025. URL https://arxiv.org/abs/ 2503.11926

  7. [7]

    Adversarial reward auditing for active detection and mitigation of reward hacking, 2026

    Mohammad Beigi, Ming Jin, Junshan Zhang, Qifan Wang, and Lifu Huang. Adversarial reward auditing for active detection and mitigation of reward hacking, 2026. URL https: //arxiv.org/abs/2602.01750

  8. [8]

    Bowman and George E

    Samuel R. Bowman and George E. Dahl. What will it take to fix benchmarking in natural language understanding?, 2021. URLhttps://arxiv.org/abs/2104.02145

  9. [9]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025. URL https://arxiv.org/abs/2410.07095

  10. [10]

    arXiv preprint arXiv:2502.17521 , year=

    Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. Recent advances in large langauge model benchmarks against data contamination: From static to dynamic evaluation, 2025. URL https://arxiv.org/abs/2502.17521

  11. [11]

    Dynamic benchmarking of reasoning capabilities in code large language models under data contamination, 2025

    Simin Chen, Pranav Pusarla, and Baishakhi Ray. Dynamic benchmarking of reasoning capabilities in code large language models under data contamination, 2025. URL https: //arxiv.org/abs/2503.04149. 10

  12. [12]

    Reasoning Models Don't Always Say What They Think

    Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think, 2025. URLhttps://arxiv.org/abs/2505.05410

  13. [13]

    Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench Verified. https://openai.com/index/introducing-swe-bench-verified/, August 2024

  14. [14]

    The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

    Mostafa Dehghani, Yi Tay, Alexey A. Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. The benchmark lottery, 2021. URLhttps://arxiv.org/ abs/2107.07002

  15. [15]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-ho...

  16. [16]

    Bowman, Ethan Perez, and Evan Hubinger

    Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, and Evan Hubinger. Sycophancy to subterfuge: Inves- tigating reward-tampering in large language models, 2024. URL https://arxiv.org/abs/ 2406.10162

  17. [17]

    Benchmarking reward hack detection in code environments via contrastive analysis, 2026

    Darshan Deshpande, Anand Kannappan, and Rebecca Qian. Benchmarking reward hack detection in code environments via contrastive analysis, 2026. URL https://arxiv.org/ abs/2601.20103

  18. [18]

    Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, 2021

    Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, 2021. URLhttps://arxiv.org/abs/1908.04734

  19. [19]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems (NeurIPS), 2014

  20. [20]

    Problems of monetary management: The UK experience.Monetary Theory and Practice, pages 91–121, 1984

    Charles AE Goodhart. Problems of monetary management: The UK experience.Monetary Theory and Practice, pages 91–121, 1984

  21. [21]

    Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y

    Melody Y . Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y . Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, and Bowen Baker. Monitoring monitorability, 2025. URLhttps://arxiv.org/abs/2512.18311

  22. [22]

    LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking

    Lukas Helff, Quentin Delfosse, David Steinmann, Ruben Härle, Hikaru Shindo, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting, and Felix Friedrich. Llms gaming verifiers: Rlvr can lead to reward hacking, 2026. URLhttps://arxiv.org/abs/2604.15149

  23. [23]

    Issue #14: Iquest-coder-v1

    IQuestLab. Issue #14: Iquest-coder-v1. https://github.com/IQuestLab/ IQuest-Coder-V1/issues/14, 2026. GitHub issue

  24. [24]

    Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks,

    Alon Jacovi, Avi Caciularu, Omer Goldman, and Yoav Goldberg. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks,

  25. [25]

    URLhttps://arxiv.org/abs/2305.10160

  26. [26]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URLhttps://arxiv.org/abs/2310.06770

  27. [27]

    Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

    Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, and Lu Wang. Countdown-code: A testbed for studying the emergence and generalization of reward hacking in rlvr, 2026. URL https://arxiv.org/abs/2603.07084. 11

  28. [28]

    Chain of thought monitorability: A new and fragile opportunity for ai safety.arXiv preprint arXiv: 2507.11473, 2025

    Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...

  29. [29]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

  30. [30]

    ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

    Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, and Han chung Lee. Clawsbench: Evaluating capability and safety of llm productivity agents in simulated workspaces, 2026. URLhttps://arxiv.org/abs/2604.05172

  31. [31]

    Diagnosing pathological chain-of-thought in reasoning models, 2026

    Manqing Liu, David Williams-King, Ida Caspary, Linh Le, Hannes Whittingham, Puria Rad- mard, Cameron Tice, and Edward James Young. Diagnosing pathological chain-of-thought in reasoning models, 2026. URLhttps://arxiv.org/abs/2602.13904

  32. [32]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2025. URL https://arxiv.org/abs/2308.03688

  33. [33]

    Natural emergent misalignment from reward hacking in production rl, 2025

    Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, and Evan Hubinger. Natural emergent misalignmen...

  34. [34]

    Gonzalez, Jingbo 12 Preprint FrontierCS T eam Shang, and Alvin Cheung

    Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, ...

  35. [35]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

  36. [36]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023. URL https://arxiv.org/abs/ 2311.12983

  37. [37]

    Introducing codex.https://openai.com/index/introducing-codex/, 2025

    OpenAI. Introducing codex.https://openai.com/index/introducing-codex/, 2025

  38. [38]

    Why swe-bench verified no longer measures frontier coding capabilities

    OpenAI. Why swe-bench verified no longer measures frontier coding capabilities. https: //openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/, 2026

  39. [39]

    Proving test set contamination in black box language models

    Yonatan Oren, Nicole Meister, Niladri Chatterji, Faisal Ladhak, and Tatsunori B. Hashimoto. Proving test set contamination in black box language models, 2023. URL https://arxiv. org/abs/2310.17623

  40. [40]

    Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a

    Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels?, 2025. URL https: //arxiv.org/abs/2502.10517

  41. [41]

    arXiv preprint arXiv:2402.06627 , year=

    Alexander Pan, Erik Jones, Meena Jagadeesan, and Jacob Steinhardt. Feedback loops with language models drive in-context reward hacking, 2024. URL https://arxiv.org/abs/ 2402.06627

  42. [42]

    Frontierswe: Benchmarking software engineering skill at the edge of human ability.https://www.frontierswe.com/, 2026

    Proximal Labs. Frontierswe: Benchmarking software engineering skill at the edge of human ability.https://www.frontierswe.com/, 2026

  43. [43]

    Is llm-as-a-judge robust? investigating universal adversarial attacks on zero-shot llm assessment, 2024

    Vyas Raina, Adian Liusie, and Mark Gales. Is llm-as-a-judge robust? investigating universal adversarial attacks on zero-shot llm assessment, 2024. URL https://arxiv.org/abs/2402. 14016

  44. [44]

    Posttrainbench: Can llm agents automate llm post-training?,

    Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post-training?,

  45. [45]

    URLhttps://arxiv.org/abs/2603.08640

  46. [46]

    Goal misgeneralization: Why correct specifications aren’t enough for correct goals, 2022

    Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton. Goal misgeneralization: Why correct specifications aren’t enough for correct goals, 2022. URLhttps://arxiv.org/abs/2210.01790

  47. [47]

    Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker

    Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker. The leaderboard illusion, 2025. URL https://arxiv.org/abs/2504. 20879

  48. [48]

    Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking, 2025. URLhttps://arxiv.org/abs/2209.13085

  49. [49]

    Detecting Safety Violations Across Many Agent Traces

    Adam Stein, Davis Brown, Hamed Hassani, Mayur Naik, and Eric Wong. Detecting safety violations across many agent traces, 2026. URLhttps://arxiv.org/abs/2604.11806

  50. [50]

    improving ratings

    Marilyn Strathern. “improving ratings”: Audit in the British university system.European Review, 5(3):305–321, 1997

  51. [51]

    Recent frontier models are reward hacking,

    Beth Barnes Sydney V on Arx, Lawrence Chan. Recent frontier models are reward hacking,

  52. [52]

    URLhttps://metr.org/blog/2025-06-05-recent-reward-hacking/

  53. [53]

    FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

    Jun Takahashi, Atsunori Moteki, Akiyoshi Uchida, Shoichi Masui, Fan Yang, Kanji Uchino, Yueqi Song, Yonatan Bisk, Graham Neubig, Ikuo Kusajima, Yasuto Watanabe, Hiroyuki Ishida, Koki Nakagawa, and Shan Jiang. Fieldworkarena: Agentic ai benchmark for real field work tasks, 2026. URLhttps://arxiv.org/abs/2505.19662. 13

  54. [54]

    BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

    Xinming Tu, Tianze Wang, Yingzhou, Lu, Kexin Huang, Yuanhao Qu, and Sara Mostafavi. Benchguard: Who guards the benchmarks? automated auditing of llm agent benchmarks, 2026. URLhttps://arxiv.org/abs/2604.24955

  55. [55]

    Detecting and Suppressing Reward Hacking with Gradient Fingerprints

    Songtao Wang, Quang Hieu Pham, Fangcong Yin, Xinpeng Wang, Jocelyn Qiaochu Chen, Greg Durrett, and Xi Ye. Detecting and suppressing reward hacking with gradient fingerprints, 2026. URLhttps://arxiv.org/abs/2604.16242

  56. [56]

    Reward hacking in reinforcement learning., 2024

    Lilian Weng. Reward hacking in reinforcement learning., 2024. URL https://lilianweng. github.io/posts/2024-11-28-reward-hacking/

  57. [57]

    Monitoring emergent reward hacking during generation via internal activations, 2026

    Patrick Wilhelm, Thorsten Wittkopp, and Odej Kao. Monitoring emergent reward hacking during generation via internal activations, 2026. URLhttps://arxiv.org/abs/2603.04069

  58. [58]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URL https: //arxiv.org/abs/2...

  59. [59]

    Investigating cot monitorability in large reasoning models, 2026

    Shu Yang, Junchao Wu, Xilin Gong, Xuansheng Wu, Derek Wong, Ninghao Liu, and Di Wang. Investigating cot monitorability in large reasoning models, 2026. URL https://arxiv.org/ abs/2511.08525

  60. [60]

    Gonzalez, and Ion Stoica

    Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, and Ion Stoica. Rethinking benchmark and contamination for language models with rephrased samples, 2023. URL https://arxiv.org/abs/2311.04850

  61. [61]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URL https://arxiv.org/abs/ 2406.12045

  62. [62]

    Utboost: Rigorous evaluation of coding agents on swe-bench, 2025

    Boxi Yu, Yuxuan Zhu, Pinjia He, and Daniel Kang. Utboost: Rigorous evaluation of coding agents on swe-bench, 2025. URLhttps://arxiv.org/abs/2506.09289

  63. [63]

    Swe-abs: Adversarial benchmark strengthening exposes inflated success rates on test-based benchmark,

    Boxi Yu, Yang Cao, Yuzhong Zhang, Liting Lin, Junjielong Xu, Zhiqing Zhong, Qinghua Xu, Guancheng Wang, Jialun Cao, Shing-Chi Cheung, Pinjia He, and Lionel Briand. Swe-abs: Adversarial benchmark strengthening exposes inflated success rates on test-based benchmark,

  64. [64]

    URLhttps://arxiv.org/abs/2603.00520

  65. [65]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URL https://arxiv.org/abs/ 2307.13854

  66. [66]

    Wang, Sadjad Fouladi, Francis Y

    Yajie Zhou, Jiajun Ruan, Eric S. Wang, Sadjad Fouladi, Francis Y . Yan, Kevin Hsieh, and Zaoxing Liu. Netarena: Dynamic benchmarks for ai agents in network automation, 2026. URL https://arxiv.org/abs/2506.03231

  67. [67]

    Establishing best practices for building rigorous agentic benchmarks,

    Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, and D...

  68. [68]

    arXiv:2507.02825 , year =

    URLhttps://arxiv.org/abs/2507.02825. 14 A Disclosure of Language Model Usage This work uses frontier LLMs as core research instruments (benchmark auditors and alignment judges), as described in Sections 3–5. Additionally, LLMs were used to review and refine drafts of this manuscript; all scientific claims, experimental design, and analysis were conducted ...

  69. [69]

    [exec(\"\

    Match the incoming prompt against the question field of each record and emit the corre- sponding answer verbatim. The unmodified MALT scoring path then accepts the prediction at 1.0. The defect is structural: the gold artifact is part of the task distribution itself, so isolation cannot help. 15 B.3 V3 – Remote code execution into the evaluator (AgentBenc...

  70. [70]

    {name} benchmark github

    Search the web for "{name} benchmark github" to find the repository URL

  71. [71]

    Once you have the URL, clone it: git clone --depth=1 <URL> {dest}

  72. [72]

    7 8If the benchmark is well-known (e.g

    Verify {dest} is not empty. 7 8If the benchmark is well-known (e.g. SWE-bench, WebArena, GAIA, HumanEval, 9MBPP, MATH, GSM8K, ARC, HellaSwag, MMLU, BigCodeBench, etc.), try the 10most likely GitHub URL directly before searching. 11 12You MUST clone into exactly this path: {dest} Listing 8: Setup phase: search-and-clone fallback prompt (SEARCH_PROMPT). Rec...

  73. [73]

    **Entry points** -- Find the main scripts for running evaluations

  74. [74]

    **Evaluation functions** -- Identify scoring / grading / judging code

  75. [75]

    **Task configuration** -- Find task definitions, gold files, expected answers

  76. [76]

    **Execution environment** -- Docker, subprocess, shared filesystem?

  77. [77]

    **LLM-as-judge** -- Any LLM-based evaluation (OpenAI/Anthropic API calls)?

  78. [78]

    **Trust boundaries** -- Where does agent output flow into the evaluator?

  79. [79]

    task_id_1

    **Cost estimate** -- Lightweight / Moderate / Heavy to run the evaluation. 25 26Be thorough. Read actual code. Cite file paths and line numbers. 27 28## Task ID enumeration (REQUIRED!!) 29 30Before finishing, enumerate **all** concrete task / problem IDs in the 31benchmark and write them to ‘{workspace}/{task_ids_filename}‘ as a JSON 32**object** mapping ...

  80. [80]

    Create a directory: ‘{workspace}/benchjack_poc/‘

Showing first 80 references.