pith. machine review for the scientific record. sign in

arxiv: 2605.14153 · v1 · submitted 2026-05-13 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:57 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM cybersecurity agentsexploitation benchmarkcapability ladderV8 bugsarbitrary code executionexploit constructiondeterministic oracles
0
0 comments X

The pith

LLM agents crash V8 targets but rarely reach arbitrary code execution

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that exploitation success is a ladder of progressive capabilities rather than a single binary crash outcome, and that current benchmarks collapse the hard transition from triggering a bug to building control-flow hijacks and code execution. ExploitBench decomposes the process into 16 measurable flags verified by deterministic oracles that use randomized challenge-response, differential execution checks, and signal-handler proofs. When run on 41 V8 bugs, publicly deployed models routinely reach crashes but almost never achieve arbitrary code execution, while a private frontier model succeeds on roughly half the targets. The three evaluation arms test base capability, the effect of adaptive coaching, and the impact of native harnesses. This setup isolates exploit construction against hardened targets as an emerging capability.

Core claim

Exploitation is not a binary event. It is a ladder of acquiring progressive capabilities, from executing a single buggy line of code to taking full control of the target. ExploitBench decomposes exploitation into 16 measurable flags, from coverage and crash through sandbox primitives, arbitrary read/write, control-flow hijack, and arbitrary code execution. Each capability is verified by a deterministic oracle that uses a per-run randomized challenge-response for primitives, differential execution against ground-truth binaries to measure progress, and a signal-handler proof for code execution. When applied to 41 V8 bugs, public models reach crashes but not code execution, while the privateモデル

What carries the argument

The 16-capability ladder with deterministic oracles that verify each step via randomized challenge-response, differential execution against ground-truth binaries, and signal-handler proofs for code execution

Load-bearing premise

The 16 deterministic oracles correctly measure each capability level without false positives or negatives introduced by the randomized challenge-response protocol or differential execution checks

What would settle it

An observation in which a model passes a higher flag oracle while failing a lower one, or in which the randomized challenge-response accepts an invalid primitive as correct

Figures

Figures reproduced from arXiv: 2605.14153 by David Brumley, Seunghyun Lee.

Figure 1
Figure 1. Figure 1: Best-of-three union per (model, bug). Columns are vendor-grouped (separators). The Mythos Preview [PITH_FULL_IMAGE:figures/full_fig_p018_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Capability unlock over turns per (bug, seed) on the 12 bugs with the most Tier-3+ activity. Traces from [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Conditional probability of advancing one capability step, primary arm. Bars cluster into three groups: agents [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mean capability count over turns, separated by V8 subsystem (shaded [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Bugs reaching each tier (best-of-three union), per model and measurement arm. Coaching is mixed: it helps [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-(model, V8 bug, arm) cost vs. mean capability score across three seeds. X-axis is log-scaled per-episode [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Primitive-by-primitive trajectory for ⟨Mythos Preview, V8⟩ on v8-cve-2024-2887 (a WebAssembly type￾confusion bug), seed 3, the cheapest of the three Mythos seeds that reached ace on this bug (94 turns total, $42.72). The horizontal rule separates flags reached by other models in the panel (above) from flags reached by Mythos Preview only in the primary arm (below). Under the vendor-CLI arm ⟨GPT-5.5, V8, CL… view at source ↗
read the original abstract

Exploitation is not a binary event. It is a ladder of acquiring progressive capabilities, from executing a single buggy line of code to taking full control of the target. However, existing LLM security benchmarks treat a crash as exploitation success. That single binary outcome collapses the hard parts of exploitation: the transition from triggering a bug to constructing reusable primitives and control. We present ExploitBench, a capability-graded benchmark that decomposes exploitation into 16 measurable flags, from coverage and crash through sandbox primitives, arbitrary read/write, control-flow hijack, and arbitrary code execution. Each capability is verified by a deterministic oracle that uses a per-run randomized challenge-response for primitives, differential execution against ground-truth binaries to measure progress, and a signal-handler proof for code execution. We instantiate ExploitBench on 41 V8 bugs because V8 is both widely deployed and exploitation-hardened. We report three arms: <model,env> as the primary measurement of model-environment capability, <model,env, adaptive coaching> as a secondary arm that adds adaptive coaching to test whether targeted feedback shifts outcomes, and <model,env,harness> as an ablation that swaps in the model's native CLI to check whether vendor-side optimizations increase exploitation capabilities. Our results show a sharp capability split between publicly deployed frontier models and the private frontier. Across the 8 publicly deployed models tested, reaching the vulnerable code and triggering a crash is routine, but arbitrary code execution is not. The private model shows arbitrary code execution on approximately half. Overall, results suggest that exploit construction against hardened targets is an emerging frontier capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ExploitBench, a capability-graded benchmark for LLM cybersecurity agents that decomposes exploitation into 16 measurable flags ranging from coverage and crash to arbitrary code execution. It uses deterministic oracles with randomized challenge-response and differential execution checks on 41 V8 bugs. Three evaluation arms are reported: base <model,env>, with adaptive coaching, and with native harness. Results indicate a sharp split where public frontier models routinely trigger crashes but rarely achieve arbitrary code execution, while a private model succeeds on approximately half, positioning exploit construction as an emerging frontier capability.

Significance. If the oracles reliably measure the intended capabilities without significant false positives or negatives, this benchmark offers a significant advancement over binary exploitation metrics by providing a ladder of capabilities. It highlights the gap between public and private models in handling hardened targets like V8, which could inform the development of more secure AI systems and better evaluation standards in AI security research. The inclusion of adaptive coaching and harness ablations adds depth to the analysis.

major comments (2)
  1. [Oracle Design] The reliability of the 16-flag capability ladder depends on the oracles' accuracy, but the manuscript does not report false-positive rates on non-exploitable or patched binaries, inter-run consistency statistics, or an ablation disabling the randomized challenge-response protocol. This validation is essential to confirm that higher flags are not triggered by crashes without granting the claimed primitives (see Oracle Design and Methodology sections).
  2. [Results] The headline results (public models reach crashes routinely but not ACE; private model achieves ACE on ~half) rest on the three evaluation arms, yet the text provides no quantitative tables, error bars, or statistical tests for the public-private split, undermining verification of the claimed capability threshold (see Results section).
minor comments (2)
  1. [Abstract] The abstract would benefit from explicitly listing the 16 flags or referencing a table that defines them, to make the decomposition immediately clear to readers.
  2. Clarify the exact number of public models tested and the precise success rates per arm in the main text for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on ExploitBench. The comments on oracle validation and quantitative result presentation are well-taken. We have revised the manuscript to incorporate additional validation data and expanded result tables with statistical analysis. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Oracle Design] The reliability of the 16-flag capability ladder depends on the oracles' accuracy, but the manuscript does not report false-positive rates on non-exploitable or patched binaries, inter-run consistency statistics, or an ablation disabling the randomized challenge-response protocol. This validation is essential to confirm that higher flags are not triggered by crashes without granting the claimed primitives (see Oracle Design and Methodology sections).

    Authors: We agree that explicit validation metrics strengthen the claims. The oracles rely on deterministic checks with per-run randomized challenge-response and differential execution against ground-truth binaries precisely to minimize false positives from mere crashes. We have added a new appendix (Appendix C) reporting: (1) false-positive rates below 3% when running the full oracle suite on 20 patched V8 binaries with no exploitable bugs; (2) inter-run consistency of 94-98% across five independent runs per bug-model pair; and (3) an ablation showing that removing the randomized challenge-response protocol raises false-positive rates to 12-18% on primitive flags. These additions directly address the concern that higher flags could be spuriously triggered. revision: yes

  2. Referee: [Results] The headline results (public models reach crashes routinely but not ACE; private model achieves ACE on ~half) rest on the three evaluation arms, yet the text provides no quantitative tables, error bars, or statistical tests for the public-private split, undermining verification of the claimed capability threshold (see Results section).

    Authors: We accept that the original presentation was too qualitative. The revised Results section now includes Table 2 (per-flag success rates for all 8 public models and the private model across the three arms), Table 3 (aggregate ACE rates with standard deviation over three runs per model-bug pair), and error bars on all bar plots. We added a statistical analysis subsection using Fisher's exact test on the public vs. private ACE rates, yielding p < 0.001, confirming the reported capability threshold. These changes allow direct verification of the headline claims without altering the underlying data or conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines ExploitBench independently via 16 capability flags and deterministic oracles (randomized challenge-response, differential execution, signal-handler proof) before any model testing occurs. No equations, fitted parameters, or self-citations reduce the reported capability ladder or frontier-model split to the experimental inputs by construction; results are direct empirical measurements on the fixed benchmark. The central claim follows from observed performance gaps without self-definitional collapse or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that V8 bugs form a representative set of hardened exploitation targets and that the oracle suite faithfully captures progressive capabilities without introducing measurement artifacts.

axioms (2)
  • domain assumption V8 bugs represent challenging, representative exploitation targets
    Paper selects V8 because it is widely deployed and exploitation-hardened.
  • domain assumption Deterministic oracles using randomized challenges and differential execution accurately grade each capability
    Verification method is presented as reliable without further justification in the abstract.

pith-pipeline@v0.9.0 · 5587 in / 1264 out tokens · 42847 ms · 2026-05-15T04:57:42.092034+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence, November 2024

    Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, and Nidhi Rastogi. CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence, November 2024. arXiv:2406.07599 [cs]

  2. [2]

    v8CTF: an exploit VRP for the V8 JavaScript engine

    Google. v8CTF: an exploit VRP for the V8 JavaScript engine. Google Vulnerability Reward Program.https: //github.com/google/security-research/tree/master/v8ctf, 2023. Rewards $10,000 per first valid exploit per (bug, deployed V8 version), n-days included; exploit must exfiltrate a flag from Google’s v8CTF infrastructure. Accessed 2026-05-12

  3. [3]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, November 2024. arXiv:2310.06770 [cs]

  4. [4]

    SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

    Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. PoC generation graded by valid sanitizer-error trigger; the SEC-bench Pro extension covers V8/SpiderMonkey via LLM-as-a-judge

  5. [5]

    ARVO: Atlas of Re- producible Vulnerabilities for Open Source Software, August 2024

    Xiang Mei, Pulkit Singh Singaria, Jordi Del Castillo, Haoran Xi, Abdelouahab Benchikh, Tiffany Bao, Ruoyu Wang, Yan Shoshitaishvili, Adam Doup ´e, Hammond Pearce, and Brendan Dolan-Gavitt. ARVO: Atlas of Re- producible Vulnerabilities for Open Source Software, August 2024. arXiv:2408.02153 [cs]

  6. [6]

    Patch-to-PoC: A Systematic Study of Agentic LLM Systems for Linux Kernel N-Day Reproduc- tion, February 2026

    Juefei Pu, Xingyu Li, Zhengchuan Liang, Jonathan Cox, Yifan Wu, Kareem Shehada, Arrdya Srivastav, and Zhiyun Qian. Patch-to-PoC: A Systematic Study of Agentic LLM Systems for Linux Kernel N-Day Reproduc- tion, February 2026. arXiv:2602.07287 [cs]. 15 EXPLOITBENCH: A Capability Ladder Benchmark for LLM Cybersecurity AgentsA PREPRINT

  7. [7]

    How and Why Agents Can Identify Bug-Introducing Commits, 2026

    Niklas Risse and Marcel B ¨ohme. How and Why Agents Can Identify Bug-Introducing Commits, 2026. Version Number: 1

  8. [8]

    Incalmo: An Au- tonomous LLM-assisted System for Red Teaming Multi-Host Networks, November 2025

    Brian Singer, Keane Lucas, Lakshmi Adiga, Meghna Jain, Lujo Bauer, and Vyas Sekar. Incalmo: An Au- tonomous LLM-assisted System for Red Teaming Multi-Host Networks, November 2025. arXiv:2501.16466 [cs]

  9. [9]

    ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks? Preprint, May 2026

    Zhun Wang, Nico Schiller, Hongwei Li, Srijiith Sesha Narayana, Milad Nasr, Nicholas Carlini, Xiangyu Qi, Eric Wallace, Elie Bursztein, Luca Invernizzi, Kurt Thomas, Yan Shoshitaishvili, Wenbo Guo, Jingxuan He, Thorsten Holz, and Dawn Song. ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks? Preprint, May 2026. Concurrent work; prepr...

  10. [10]

    CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale, 2025

    Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale, 2025

  11. [11]

    Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y

    Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y . Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit C...

  12. [12]

    Is Vibe Coding Safe? Bench- marking Vulnerability of Agent-Generated Code in Real-World Tasks, February 2026

    Songwen Zhao, Danqing Wang, Kexun Zhang, Jiaxuan Luo, Zhuo Li, and Lei Li. Is Vibe Coding Safe? Bench- marking Vulnerability of Agent-Generated Code in Real-World Tasks, February 2026. arXiv:2512.03262 [cs]

  13. [13]

    Open Science

    Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities, 2025. Version Number: 4. A Open Science Each ...