{"paper":{"title":"Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack","license":"http://creativecommons.org/licenses/by/4.0/","headline":"BenchJack automatically uncovers reward-hacking exploits that let agents score near-perfect on popular benchmarks without completing tasks.","cross_cats":["cs.CR"],"primary_cat":"cs.AI","authors_text":"Alvin Cheung, Dawn Song, Hanchen Li, Hao Wang, Koushik Sen, Qiuyang Mang","submitted_at":"2026-05-12T19:22:45Z","abstract_excerpt":"Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that exploits discovered by BenchJack using its own auditing agents represent genuine, transferable reward hacks that would succeed on standard frontier models rather than being artifacts of the clairvoyant auditing setup or specific model choices.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"BenchJack automatically uncovers reward-hacking exploits that let agents score near-perfect on popular benchmarks without completing tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9c6b913549da44b80e5bb001f5f748dc5562cc9c34454ef3259069082cdc626d"},"source":{"id":"2605.12673","kind":"arxiv","version":1},"verdict":{"id":"66f3423a-b218-4c8c-823c-0a7fd915aca9","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T20:28:51.717346Z","strongest_claim":"BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations.","one_line_summary":"BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that exploits discovered by BenchJack using its own auditing agents represent genuine, transferable reward hacks that would succeed on standard frontier models rather than being artifacts of the clairvoyant auditing setup or specific model choices.","pith_extraction_headline":"BenchJack automatically uncovers reward-hacking exploits that let agents score near-perfect on popular benchmarks without completing tasks."},"references":{"count":129,"sample":[{"doi":"","year":2016,"title":"Concrete Problems in AI Safety","work_id":"c8d14fbe-6eab-464a-95b3-778aabd82fa3","ref_index":1,"cited_arxiv_id":"1606.06565","is_internal_anchor":true},{"doi":"","year":2026,"title":"Alignment risk update: Claude mythos preview, 2026","work_id":"fbe6a6ba-8766-4e3d-b929-b939ced8b6cb","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"Anthropic / Community Sources. Claude code. https://www.anthropic.com/product/ claude-code, 2026","work_id":"d1eb7df5-3376-46e2-a9f2-f03ac7db130b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"Analyzing and improving chain-of-thought monitorability through information theory, 2026","work_id":"fffbe2f1-53bd-4d5c-a959-a68aa38033fc","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"Rewardhackingagents: Benchmarking evaluation integrity for llm ml-engineering agents, 2026","work_id":"ac8b61ac-fef5-41a9-ac9b-22d01a05a183","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":129,"snapshot_sha256":"2a818494815b1a60dbe285db4cc33f6f93b4cc0a9f3224aa5df7e688ddeb155b","internal_anchors":21},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}