Recognition: 2 theorem links
· Lean TheoremExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents
Pith reviewed 2026-05-15 04:57 UTC · model grok-4.3
The pith
LLM agents crash V8 targets but rarely reach arbitrary code execution
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Exploitation is not a binary event. It is a ladder of acquiring progressive capabilities, from executing a single buggy line of code to taking full control of the target. ExploitBench decomposes exploitation into 16 measurable flags, from coverage and crash through sandbox primitives, arbitrary read/write, control-flow hijack, and arbitrary code execution. Each capability is verified by a deterministic oracle that uses a per-run randomized challenge-response for primitives, differential execution against ground-truth binaries to measure progress, and a signal-handler proof for code execution. When applied to 41 V8 bugs, public models reach crashes but not code execution, while the privateモデル
What carries the argument
The 16-capability ladder with deterministic oracles that verify each step via randomized challenge-response, differential execution against ground-truth binaries, and signal-handler proofs for code execution
Load-bearing premise
The 16 deterministic oracles correctly measure each capability level without false positives or negatives introduced by the randomized challenge-response protocol or differential execution checks
What would settle it
An observation in which a model passes a higher flag oracle while failing a lower one, or in which the randomized challenge-response accepts an invalid primitive as correct
Figures
read the original abstract
Exploitation is not a binary event. It is a ladder of acquiring progressive capabilities, from executing a single buggy line of code to taking full control of the target. However, existing LLM security benchmarks treat a crash as exploitation success. That single binary outcome collapses the hard parts of exploitation: the transition from triggering a bug to constructing reusable primitives and control. We present ExploitBench, a capability-graded benchmark that decomposes exploitation into 16 measurable flags, from coverage and crash through sandbox primitives, arbitrary read/write, control-flow hijack, and arbitrary code execution. Each capability is verified by a deterministic oracle that uses a per-run randomized challenge-response for primitives, differential execution against ground-truth binaries to measure progress, and a signal-handler proof for code execution. We instantiate ExploitBench on 41 V8 bugs because V8 is both widely deployed and exploitation-hardened. We report three arms: <model,env> as the primary measurement of model-environment capability, <model,env, adaptive coaching> as a secondary arm that adds adaptive coaching to test whether targeted feedback shifts outcomes, and <model,env,harness> as an ablation that swaps in the model's native CLI to check whether vendor-side optimizations increase exploitation capabilities. Our results show a sharp capability split between publicly deployed frontier models and the private frontier. Across the 8 publicly deployed models tested, reaching the vulnerable code and triggering a crash is routine, but arbitrary code execution is not. The private model shows arbitrary code execution on approximately half. Overall, results suggest that exploit construction against hardened targets is an emerging frontier capability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ExploitBench, a capability-graded benchmark for LLM cybersecurity agents that decomposes exploitation into 16 measurable flags ranging from coverage and crash to arbitrary code execution. It uses deterministic oracles with randomized challenge-response and differential execution checks on 41 V8 bugs. Three evaluation arms are reported: base <model,env>, with adaptive coaching, and with native harness. Results indicate a sharp split where public frontier models routinely trigger crashes but rarely achieve arbitrary code execution, while a private model succeeds on approximately half, positioning exploit construction as an emerging frontier capability.
Significance. If the oracles reliably measure the intended capabilities without significant false positives or negatives, this benchmark offers a significant advancement over binary exploitation metrics by providing a ladder of capabilities. It highlights the gap between public and private models in handling hardened targets like V8, which could inform the development of more secure AI systems and better evaluation standards in AI security research. The inclusion of adaptive coaching and harness ablations adds depth to the analysis.
major comments (2)
- [Oracle Design] The reliability of the 16-flag capability ladder depends on the oracles' accuracy, but the manuscript does not report false-positive rates on non-exploitable or patched binaries, inter-run consistency statistics, or an ablation disabling the randomized challenge-response protocol. This validation is essential to confirm that higher flags are not triggered by crashes without granting the claimed primitives (see Oracle Design and Methodology sections).
- [Results] The headline results (public models reach crashes routinely but not ACE; private model achieves ACE on ~half) rest on the three evaluation arms, yet the text provides no quantitative tables, error bars, or statistical tests for the public-private split, undermining verification of the claimed capability threshold (see Results section).
minor comments (2)
- [Abstract] The abstract would benefit from explicitly listing the 16 flags or referencing a table that defines them, to make the decomposition immediately clear to readers.
- Clarify the exact number of public models tested and the precise success rates per arm in the main text for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on ExploitBench. The comments on oracle validation and quantitative result presentation are well-taken. We have revised the manuscript to incorporate additional validation data and expanded result tables with statistical analysis. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Oracle Design] The reliability of the 16-flag capability ladder depends on the oracles' accuracy, but the manuscript does not report false-positive rates on non-exploitable or patched binaries, inter-run consistency statistics, or an ablation disabling the randomized challenge-response protocol. This validation is essential to confirm that higher flags are not triggered by crashes without granting the claimed primitives (see Oracle Design and Methodology sections).
Authors: We agree that explicit validation metrics strengthen the claims. The oracles rely on deterministic checks with per-run randomized challenge-response and differential execution against ground-truth binaries precisely to minimize false positives from mere crashes. We have added a new appendix (Appendix C) reporting: (1) false-positive rates below 3% when running the full oracle suite on 20 patched V8 binaries with no exploitable bugs; (2) inter-run consistency of 94-98% across five independent runs per bug-model pair; and (3) an ablation showing that removing the randomized challenge-response protocol raises false-positive rates to 12-18% on primitive flags. These additions directly address the concern that higher flags could be spuriously triggered. revision: yes
-
Referee: [Results] The headline results (public models reach crashes routinely but not ACE; private model achieves ACE on ~half) rest on the three evaluation arms, yet the text provides no quantitative tables, error bars, or statistical tests for the public-private split, undermining verification of the claimed capability threshold (see Results section).
Authors: We accept that the original presentation was too qualitative. The revised Results section now includes Table 2 (per-flag success rates for all 8 public models and the private model across the three arms), Table 3 (aggregate ACE rates with standard deviation over three runs per model-bug pair), and error bars on all bar plots. We added a statistical analysis subsection using Fisher's exact test on the public vs. private ACE rates, yielding p < 0.001, confirming the reported capability threshold. These changes allow direct verification of the headline claims without altering the underlying data or conclusions. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper defines ExploitBench independently via 16 capability flags and deterministic oracles (randomized challenge-response, differential execution, signal-handler proof) before any model testing occurs. No equations, fitted parameters, or self-citations reduce the reported capability ladder or frontier-model split to the experimental inputs by construction; results are direct empirical measurements on the fixed benchmark. The central claim follows from observed performance gaps without self-definitional collapse or load-bearing self-reference.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption V8 bugs represent challenging, representative exploitation targets
- domain assumption Deterministic oracles using randomized challenges and differential execution accurately grade each capability
Reference graph
Works this paper leans on
-
[1]
CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence, November 2024
Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, and Nidhi Rastogi. CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence, November 2024. arXiv:2406.07599 [cs]
-
[2]
v8CTF: an exploit VRP for the V8 JavaScript engine
Google. v8CTF: an exploit VRP for the V8 JavaScript engine. Google Vulnerability Reward Program.https: //github.com/google/security-research/tree/master/v8ctf, 2023. Rewards $10,000 per first valid exploit per (bug, deployed V8 version), n-days included; exploit must exfiltrate a flag from Google’s v8CTF infrastructure. Accessed 2026-05-12
work page 2023
-
[3]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, November 2024. arXiv:2310.06770 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. PoC generation graded by valid sanitizer-error trigger; the SEC-bench Pro extension covers V8/SpiderMonkey via LLM-as-a-judge
work page 2025
-
[5]
ARVO: Atlas of Re- producible Vulnerabilities for Open Source Software, August 2024
Xiang Mei, Pulkit Singh Singaria, Jordi Del Castillo, Haoran Xi, Abdelouahab Benchikh, Tiffany Bao, Ruoyu Wang, Yan Shoshitaishvili, Adam Doup ´e, Hammond Pearce, and Brendan Dolan-Gavitt. ARVO: Atlas of Re- producible Vulnerabilities for Open Source Software, August 2024. arXiv:2408.02153 [cs]
-
[6]
Juefei Pu, Xingyu Li, Zhengchuan Liang, Jonathan Cox, Yifan Wu, Kareem Shehada, Arrdya Srivastav, and Zhiyun Qian. Patch-to-PoC: A Systematic Study of Agentic LLM Systems for Linux Kernel N-Day Reproduc- tion, February 2026. arXiv:2602.07287 [cs]. 15 EXPLOITBENCH: A Capability Ladder Benchmark for LLM Cybersecurity AgentsA PREPRINT
-
[7]
How and Why Agents Can Identify Bug-Introducing Commits, 2026
Niklas Risse and Marcel B ¨ohme. How and Why Agents Can Identify Bug-Introducing Commits, 2026. Version Number: 1
work page 2026
-
[8]
Incalmo: An Au- tonomous LLM-assisted System for Red Teaming Multi-Host Networks, November 2025
Brian Singer, Keane Lucas, Lakshmi Adiga, Meghna Jain, Lujo Bauer, and Vyas Sekar. Incalmo: An Au- tonomous LLM-assisted System for Red Teaming Multi-Host Networks, November 2025. arXiv:2501.16466 [cs]
-
[9]
ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks? Preprint, May 2026
Zhun Wang, Nico Schiller, Hongwei Li, Srijiith Sesha Narayana, Milad Nasr, Nicholas Carlini, Xiangyu Qi, Eric Wallace, Elie Bursztein, Luca Invernizzi, Kurt Thomas, Yan Shoshitaishvili, Wenbo Guo, Jingxuan He, Thorsten Holz, and Dawn Song. ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks? Preprint, May 2026. Concurrent work; prepr...
work page 2026
-
[10]
CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale, 2025
Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale, 2025
work page 2025
-
[11]
Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y
Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y . Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit C...
work page 2025
-
[12]
Songwen Zhao, Danqing Wang, Kexun Zhang, Jiaxuan Luo, Zhuo Li, and Lei Li. Is Vibe Coding Safe? Bench- marking Vulnerability of Agent-Generated Code in Real-World Tasks, February 2026. arXiv:2512.03262 [cs]
-
[13]
Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities, 2025. Version Number: 4. A Open Science Each ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.