arxiv: 2605.11086 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?

Zhun Wang , Nico Schiller , Hongwei Li , Srijiith Sesha Narayana , Milad Nasr , Nicholas Carlini , Xiangyu Qi , Eric Wallace

show 8 more authors

Elie Bursztein Luca Invernizzi Kurt Thomas Yan Shoshitaishvili Wenbo Guo Jingxuan He Thorsten Holz Dawn Song

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:52 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords AI agentsvulnerability exploitationcybersecurity benchmarkfrontier modelssecurity defensesreal-world vulnerabilitiescode execution

0 comments

The pith

Frontier AI models can turn given vulnerability triggers into working exploits for over 100 real cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ExploitGym to measure whether AI agents can extend a provided input that triggers a vulnerability into a complete exploit achieving effects such as code execution or unauthorized access. It supplies 898 instances drawn from real userspace programs, the V8 JavaScript engine, and the Linux kernel, each packaged in a reproducible container with optional security protections. Evaluation of frontier models shows success rates of 157 exploits for one leading configuration and 120 for another, with meaningful performance retained when defenses are active. A sympathetic reader would care because exploitation bridges the gap between a known weakness and actual harm, and the benchmark quantifies how capable current agents are at crossing that gap in controlled settings.

Core claim

ExploitGym is a benchmark of 898 real-world vulnerability instances in containerized environments. Given a triggering input, agents must produce a working exploit that delivers concrete security impact. Frontier models such as Claude Mythos Preview succeed on 157 instances and GPT-5.5 on 120 instances, retaining non-trivial success even after common defenses are enabled.

What carries the argument

ExploitGym benchmark, which supplies containerized instances and requires agents to extend vulnerability triggers into full exploits while varying applied protections.

If this is right

Frontier models achieve concrete security impact on a measurable fraction of tested vulnerabilities.
Widely deployed defenses reduce but do not eliminate model success at exploitation.
The benchmark supplies a reproducible way to compare and track agent exploitation performance over time.
Results indicate that AI agents are approaching practical offensive utility in cybersecurity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Further gains on this task could reduce the human effort required to weaponize newly discovered vulnerabilities.
Dual-use findings may accelerate both automated defense tools and automated attack tooling in parallel.
Persistent success under defenses suggests the need for new mitigation strategies that target AI reasoning patterns rather than traditional protections alone.

Load-bearing premise

That success when a triggering input is supplied in isolated containers reflects the skills needed for exploitation in uncontrolled real-world settings that demand additional setup and discovery.

What would settle it

An experiment in which the same models are given no pre-supplied trigger and must operate in non-containerized environments yet still fail to produce any working exploits.

Figures

Figures reproduced from arXiv: 2605.11086 by Dawn Song, Elie Bursztein, Eric Wallace, Hongwei Li, Jingxuan He, Kurt Thomas, Luca Invernizzi, Milad Nasr, Nicholas Carlini, Nico Schiller, Srijiith Sesha Narayana, Thorsten Holz, Wenbo Guo, Xiangyu Qi, Yan Shoshitaishvili, Zhun Wang.

**Figure 2.** Figure 2: Cumulative exploits over wall-clock time (6-hour max.) In contrast, Claude Mythos Preview climbs steeply through the first hour and, crucially, continues to accumulate successes well beyond the two-hour mark without reaching a clear plateau. This non-saturating trajectory underscores Claude Mythos Preview’s ability to sustain long-horizon agentic workflows such as incremental refinement of exploit primiti… view at source ↗

**Figure 3.** Figure 3: Overlap of successes across Claude Mythos Preview, GPT-5.5, and union of other models. Different Models Solve Complementary Sets of Tasks [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Shortened trajectory of an agent exploiting a V8 vulnerability. Starting from a PoV that [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

AI agents are rapidly gaining capabilities that could significantly reshape cybersecurity, making rigorous evaluation urgent. A critical capability is exploitation: turning a vulnerability, which is not yet an attack, into a concrete security impact, such as unauthorized file access or code execution. Exploitation is a particularly challenging task because it requires low-level program reasoning (e.g., about memory layout), runtime adaptation, and sustained progress over long horizons. Meanwhile, it is inherently dual-use, supporting defensive workflows while lowering the barrier for offense. Despite its importance and diagnostic value, exploitation remains under-evaluated. To address this gap, we introduce ExploitGym, a large-scale, diverse, realistic benchmark on the exploitation capabilities of AI agents. Given a program input that triggers a vulnerability, ExploitGym tasks agents with progressively extending it into a working exploit. The benchmark comprises 898 instances sourced from real-world vulnerabilities across three domains, including userspace programs, Google's V8 JavaScript engine, and the Linux kernel. We vary the security protections applied to each instance, isolating their impact on agent performance. All configurations are packaged in reproducible containerized environments. Our evaluation shows that while exploitation remains challenging, frontier models can successfully exploit a non-trivial fraction of vulnerabilities. For example, the strongest configurations are Anthropic's latest model Claude Mythos Preview and OpenAI's GPT-5.5, which produce working exploits for 157 and 120 instances, respectively. Notably, even with widely used defenses enabled, models retain non-trivial success rates. These results establish ExploitGym as an effective testbed for exploitation and highlight the growing cybersecurity risks posed by increasingly capable AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ExploitGym, a benchmark of 898 real-world vulnerability instances drawn from userspace programs, Google's V8 engine, and the Linux kernel. Agents receive a vulnerability-triggering input and are tasked with extending it into a working exploit inside reproducible containerized environments; security protections are systematically varied to isolate their effects. Evaluation results indicate that frontier models achieve non-trivial success, with Claude Mythos Preview succeeding on 157 instances and GPT-5.5 on 120 instances, and that success rates remain positive even when common defenses are enabled.

Significance. If the reported success rates are reproducible under the stated protocol, ExploitGym supplies a useful large-scale empirical testbed for measuring AI agents' low-level program reasoning and exploit-construction capabilities. The inclusion of multiple domains, explicit defense variations, and containerized reproducibility are strengths that allow controlled isolation of factors. The work therefore contributes a concrete artifact for tracking progress on a dual-use cybersecurity task.

major comments (2)

Abstract and §3 (Benchmark Construction): the headline claim that models 'produce working exploits for 157 and 120 instances' and thereby demonstrate 'real-world exploitation capabilities' is not fully supported by the described protocol. The setup always supplies the triggering input and executes inside known, reproducible containers; this removes reconnaissance, trigger discovery, environment fingerprinting, and adaptation to unknown ASLR or external dependencies. The reported numbers therefore measure PoC extension under scaffolding rather than autonomous real-world exploitation, and the abstract does not quantify how much the success rates would drop without these prerequisites.
§4 (Evaluation) and abstract: the manuscript reports specific success counts and notes the inclusion of defenses but provides no details on the exact success-verification criteria (e.g., what constitutes a 'working exploit'), the instance-selection process, or the prompting strategies used. These omissions leave moderate gaps in support for the central performance claims and make it difficult for readers to assess whether the 157/120 figures are robust or sensitive to minor protocol changes.

minor comments (2)

§2 (Related Work): the discussion of prior AI-for-security benchmarks could more explicitly contrast ExploitGym's focus on exploit extension with existing fuzzing or vulnerability-discovery suites.
Figure 1 and Table 2: axis labels and legend entries use inconsistent capitalization and abbreviation style; a uniform notation would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We agree that the manuscript requires greater precision in describing the benchmark's scope and have revised the abstract and relevant sections to address the concerns. Our point-by-point responses to the major comments are provided below.

read point-by-point responses

Referee: [—] Abstract and §3 (Benchmark Construction): the headline claim that models 'produce working exploits for 157 and 120 instances' and thereby demonstrate 'real-world exploitation capabilities' is not fully supported by the described protocol. The setup always supplies the triggering input and executes inside known, reproducible containers; this removes reconnaissance, trigger discovery, environment fingerprinting, and adaptation to unknown ASLR or external dependencies. The reported numbers therefore measure PoC extension under scaffolding rather than autonomous real-world exploitation, and the abstract does not quantify how much the success rates would drop without these prerequisites.

Authors: We thank the referee for highlighting this distinction. The manuscript already states that agents are 'Given a program input that triggers a vulnerability' and tasked with extending it into a working exploit; the design intentionally isolates the exploitation phase to focus on low-level program reasoning and long-horizon adaptation, which remain challenging even with the trigger supplied. We do not claim to measure fully autonomous real-world attacks that include reconnaissance or environment discovery. In the revised manuscript we have updated the abstract to clarify that ExploitGym evaluates exploit generation from provided triggers within controlled, reproducible environments rather than claiming broad 'real-world exploitation capabilities'. We have also added an explicit limitations paragraph in §3 discussing the scaffolding provided. We cannot quantify the performance drop that would occur without the supplied trigger, as that would require a different experimental protocol outside the current benchmark design and is noted as future work. revision: partial
Referee: [—] §4 (Evaluation) and abstract: the manuscript reports specific success counts and notes the inclusion of defenses but provides no details on the exact success-verification criteria (e.g., what constitutes a 'working exploit'), the instance-selection process, or the prompting strategies used. These omissions leave moderate gaps in support for the central performance claims and make it difficult for readers to assess whether the 157/120 figures are robust or sensitive to minor protocol changes.

Authors: We agree that these details are essential for assessing robustness and reproducibility. In the revised manuscript we have expanded §4 with a new subsection on evaluation protocol. It now specifies the success-verification criteria (automated checks confirming the intended security impact, such as arbitrary code execution or unauthorized file access, via the container test harness), the instance-selection process (sourcing from public CVE databases, filtering for reproducible triggers and containerizable environments, and stratified sampling across the three domains), and the prompting strategies (including the base system prompt, chain-of-thought instructions, and handling of long-horizon interactions). These additions directly support the reported success counts and allow readers to evaluate sensitivity to protocol variations. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or fitted predictions

full rationale

The paper introduces ExploitGym as a benchmark and reports empirical success rates from running frontier models on 898 containerized vulnerability instances. There are no equations, derivations, parameter fits, or predictions that reduce to inputs by construction. All results are generated by direct execution in external reproducible environments, with no self-citation load-bearing on uniqueness theorems or ansatzes. The evaluation protocol is self-contained against the provided benchmark data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the construction of the new benchmark and the empirical model evaluations performed within it. No free parameters, additional axioms, or invented physical entities are required beyond standard practices for creating reproducible security benchmarks.

invented entities (1)

ExploitGym benchmark no independent evidence
purpose: To evaluate AI agents' ability to turn vulnerability triggers into working exploits
The benchmark itself is the primary new contribution introduced by the paper.

pith-pipeline@v0.9.0 · 5655 in / 1333 out tokens · 123173 ms · 2026-05-13T02:52:48.710819+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Given a program input that triggers a vulnerability, ExploitGym tasks agents with progressively extending it into a working exploit... frontier models... produce working exploits for 157 and 120 instances
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We vary the security protections applied to each instance, isolating their impact on agent performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

[1]

Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J. Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of LLM agents. InProceedings of the Thirteenth International Conference on Learning Representat...

work page 2025
[2]

Introducing claude opus 4.7

Anthropic. Introducing claude opus 4.7. https://www.anthropic.com/news/claude-o pus-4-7. Accessed 2026-05

work page 2026
[3]

Making frontier cybersecurity capabilities available to defenders

Anthropic. Making frontier cybersecurity capabilities available to defenders. https://www. anthropic.com/news/claude-code-security. Accessed 2026-05

work page 2026
[4]

Real-time cyber safeguards on claude, 2025

Anthropic. Real-time cyber safeguards on claude, 2025. Accessed: 2025-05-06

work page 2025
[5]

AEG: automatic exploit generation

Thanassis Avgerinos, Sang Kil Cha, Brent Lim Tze Hao, and David Brumley. AEG: automatic exploit generation. InNetwork and Distributed System Security, 2011

work page 2011
[6]

Empirical security analysis of software-based fault isolation through controlled fault injection

Nils Bars, Lukas Bernhard, Moritz Schloegel, and Thorsten Holz. Empirical security analysis of software-based fault isolation through controlled fault injection. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, CCS ’25, page 2639–2652, New York, NY , USA, 2025. Association for Computing Machinery

work page 2025
[7]

Unleashing mayhem on binary code

Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. Unleashing mayhem on binary code. InIEEE Symposium on Security and Privacy, 2012

work page 2012
[8]

Js-fuzzer – javascript fuzzer for stand-alone shells like d8, chakra, jsc or spidermonkey

Oliver Chang. Js-fuzzer – javascript fuzzer for stand-alone shells like d8, chakra, jsc or spidermonkey. https://chromium.googlesource.com/v8/v8/+/master/tools/clust erfuzz/js_fuzzer/README.md. Accessed 2026-05

work page 2026
[9]

TypePulse: Detecting Type Confusion Bugs in Rust Programs

Hung-Mao Chen, Xu He, Shu Wang, Xiaokuan Zhang, and Kun Sun. TypePulse: Detecting Type Confusion Bugs in Rust Programs. InUSENIX Security Symposium, 2025

work page 2025
[10]

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, et al. Secureagentbench: Benchmarking secure code generation under realistic vulnerability scenarios.arXiv preprint arXiv:2509.22097, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

StackGuard: Automatic adaptive detection and prevention of buffer-overflow attacks

Crispin Cowan, Calton Pu, Dave Maier, Jonathan Walpole, Peat Bakke, Steve Beattie, Aaron Grier, Perry Wagle, Qian Zhang, and Heather Hinton. StackGuard: Automatic adaptive detection and prevention of buffer-overflow attacks. InProceedings of the 7th USENIX Security Symposium, pages 63–78, 1998

work page 1998
[12]

Security enhancements in Red Hat Enterprise Linux (position-independent executables)

Ulrich Drepper. Security enhancements in Red Hat Enterprise Linux (position-independent executables). Technical report, Red Hat, Inc., 2003

work page 2003
[13]

Ffmpeg: A complete, cross-platform solution to record, convert and stream audio and video.https://www.ffmpeg.org/

FFmpeg. Ffmpeg: A complete, cross-platform solution to record, convert and stream audio and video.https://www.ffmpeg.org/. Accessed: 2025-05-10

work page 2025
[14]

AFL++ : Combin- ing incremental steps of fuzzing research

Andrea Fioraldi, Dominik Christian Maier, Heiko Eißfeldt, and Marc Heuse. AFL++ : Combin- ing incremental steps of fuzzing research. InUSENIX Workshop on Offensive Technologies, 2020

work page 2020
[15]

A deep dive into V8 sandbox escape technique used in in-the-wild exploit

Frontier Squad. A deep dive into V8 sandbox escape technique used in in-the-wild exploit. Theori Blog, January 2024. Accessed: 2026-05-09

work page 2024
[16]

nsjail: A light-weight process isolation tool

Google. nsjail: A light-weight process isolation tool. https://github.com/google/nsjail. Accessed: 2026-04-21

work page 2026
[17]

syzbot: Continuous kernel fuzzing dashboard

Google. syzbot: Continuous kernel fuzzing dashboard. https://syzkaller.appspot.com/ upstream. Accessed: 2026-04-21

work page 2026
[18]

ClusterFuzz: Scalable fuzzing infrastructure

Google. ClusterFuzz: Scalable fuzzing infrastructure. https://github.com/google/clus terfuzz, 2019. Accessed: 2026-04-21. 11

work page 2019
[19]

OSS-Fuzz: Continuous fuzzing for open source software

Google. OSS-Fuzz: Continuous fuzzing for open source software. https://github.com/g oogle/oss-fuzz, 2026

work page 2026
[20]

OSV: Open Source Vulnerabilities

Google. OSV: Open Source Vulnerabilities. https://osv.dev/ , 2026. Comprehensive vulnerability database for open source projects and dependencies. Accessed: 2026-05-03

work page 2026
[21]

kernelctf: Kernel capture-the-flag (rules and infrastructure)

Google Security Research. kernelctf: Kernel capture-the-flag (rules and infrastructure). https: //github.com/google/security-research/tree/master/kernelctf . Accessed 2026-04

work page 2026
[22]

The V8 heap sandbox.https://v8.dev/blog/sandbox, 2024

Samuel Groß. The V8 heap sandbox.https://v8.dev/blog/sandbox, 2024

work page 2024
[23]

Fuzzilli: Fuzzing for JavaScript JIT Compiler Vulnerabilities

Samuel Groß, Simon Koch, Lukas Bernhard, Thorsten Holz, and Martin Johns. Fuzzilli: Fuzzing for JavaScript JIT Compiler Vulnerabilities. InNDSS, 2023

work page 2023
[24]

TypeSan: Practical Type Confusion Detection

Istvan Haller, Yuseok Jeon, Hui Peng, Mathias Payer, Cristiano Giuffrida, Herbert Bos, and Erik Van Der Kouwe. TypeSan: Practical Type Confusion Detection. InACM SIGSAC Conference on Computer and Communications Security, 2016

work page 2016
[25]

Automatic heap layout manipulation for exploitation

Sean Heelan, Tom Melham, and Daniel Kroening. Automatic heap layout manipulation for exploitation. InUSENIX Security, 2018

work page 2018
[26]

Gollum: Modular and greybox exploit generation for heap overflows in interpreters

Sean Heelan, Tom Melham, and Daniel Kroening. Gollum: Modular and greybox exploit generation for heap overflows in interpreters. InACM SIGSAC Conference on Computer and Communications Security, 2019

work page 2019
[27]

Data-oriented programming: On the expressiveness of non-control data attacks

Hong Hu, Shweta Shinde, Sendroiu Adrian, Zheng Leong Chua, Prateek Saxena, and Zhenkai Liang. Data-oriented programming: On the expressiveness of non-control data attacks. InIEEE Symposium on Security and Privacy, 2016

work page 2016
[28]

SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024

work page 2024
[29]

SEC-bench: Automated Bench- marking of LLM Agents on Real-World Software Security Tasks

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. SEC-bench: Automated Bench- marking of LLM Agents on Real-World Software Security Tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[30]

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew Bo Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert-V oss, Cort B Breuer, Andy Z...

work page 2024
[31]

user_namespaces(7) — overview of Linux user namespaces

Linux man-pages project. user_namespaces(7) — overview of Linux user namespaces. https://man7.org/linux/man-pages/man7/user_namespaces.7.html, 2024

work page 2024
[32]

libfuzzer – a library for coverage-guided fuzz testing

LLVM. libfuzzer – a library for coverage-guided fuzz testing. https://llvm.org/docs/Li bFuzzer.html. Accessed 2026-05

work page 2026
[33]

The Art, Science, and Engineering of Fuzzing: A Survey.IEEE Transactions on Software Engineering, 47(11), 2019

Valentin JM Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J Schwartz, and Maverick Woo. The Art, Science, and Engineering of Fuzzing: A Survey.IEEE Transactions on Software Engineering, 47(11), 2019

work page 2019
[34]

CVE-2022-4543 Detail

National Institute of Standards and Technology. CVE-2022-4543 Detail. https://nvd.ni st.gov/vuln/detail/CVE-2022-4543, 2023. National Vulnerability Database. Published January 11, 2023; last modified April 8, 2025. Accessed May 9, 2026. 12

work page 2022
[35]

SECODEPLT: A unified benchmark for evaluating the security risks and capabilities of code genAI

Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, and Dawn Song. SECODEPLT: A unified benchmark for evaluating the security risks and capabilities of code genAI. InNeurIPS Datasets and Benchmarks Track, 2025

work page 2025
[36]

Smashing the stack for fun and profit.Phrack Magazine, 1996

Aleph One. Smashing the stack for fun and profit.Phrack Magazine, 1996

work page 1996
[37]

Frontier risk and preparedness

OpenAI. Frontier risk and preparedness. https://openai.com/index/frontier-risk-a nd-preparedness. Accessed 2026-05

work page 2026
[38]

Introducing gpt -5.5

OpenAI. Introducing gpt -5.5. https://openai.com/index/introducing-gpt-5-5 . Accessed 2026-05

work page 2026
[39]

Trusted access for cyber, 2025

OpenAI. Trusted access for cyber, 2025. Accessed: 2025-05-06

work page 2025
[40]

Openssl: Tls/ssl and crypto library

OpenSSL. Openssl: Tls/ssl and crypto library. https://github.com/openssl/openssl . Accessed: 2025-09-15

work page 2025
[41]

Fuzzing JavaScript Engines with Aspect-Preserving Mutation

Soyeon Park, Wen Xu, Insu Yun, Daehee Jang, and Taesoo Kim. Fuzzing JavaScript Engines with Aspect-Preserving Mutation. InIEEE Symposium on Security and Privacy (SP), 2020

work page 2020
[42]

PaX address space layout randomization (ASLR)

PaX Team. PaX address space layout randomization (ASLR). https://pax.grsecurity.n et/docs/aslr.txt, 2001

work page 2001
[43]

CVE-2016-8655: Linux AF_PACKET race condition local root exploit

Philip Pettersson. CVE-2016-8655: Linux AF_PACKET race condition local root exploit. oss-security mailing list, December 2016. Exploit source: https://github.com/bcoles/ kernel-exploits/blob/master/CVE-2016-8655/chocobo_root.c

work page 2016
[44]

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle

Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca ...

work page arXiv 2024
[45]

Return-oriented program- ming: Systems, languages, and applications.ACM Trans

Ryan Roemer, Erik Buchanan, Hovav Shacham, and Stefan Savage. Return-oriented program- ming: Systems, languages, and applications.ACM Trans. Inf. Syst. Secur., 15(1):2:1–2:34, 2012

work page 2012
[46]

V8 sandbox escape via regexp bytecode modification

rycbar77. V8 sandbox escape via regexp bytecode modification. GitHub, 2024. Technique used in V8CTF M122, M123, and PlaidCTF

work page 2024
[47]

kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels

Sergej Schumilo, Cornelius Aschermann, Robert Gawlik, Sebastian Schinzel, and Thorsten Holz. kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels. InUSENIX Security Symposium, 2017

work page 2017
[48]

The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86)

Hovav Shacham. The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86). InACM Conference on Computer and Communications Security, 2007

work page 2007
[49]

Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems, 37:57472–57498, 2024

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, et al. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems, 37:57472–57498, 2024

work page 2024
[50]

Secrepobench: Benchmarking code agents for secure code completion in real-world repositories

Chihao Shen, Connor Dilgren, Purva Chiniya, Luke Griffith, Yu Ding, and Yizheng Chen. Secrepobench: Benchmarking code agents for secure code completion in real-world repositories. arXiv preprint arXiv:2504.21205, 2025

work page arXiv 2025
[51]

Sok:(state of) the art of war: Offensive techniques in binary analysis

Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, et al. Sok:(state of) the art of war: Offensive techniques in binary analysis. In2016 IEEE symposium on security and privacy (SP), pages 138–157. IEEE, 2016

work page 2016
[52]

Sok: Eternal war in memory

Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. Sok: Eternal war in memory. In IEEE Symposium on Security and Privacy, 2013. 13

work page 2013
[53]

SoK: Eternal War in Memory

Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. SoK: Eternal War in Memory. In IEEE Symposium on Security and Privacy, 2013

work page 2013
[54]

Chromium issue tracker

The Chromium Project. Chromium issue tracker. https://issues.chromium.org/ . Accessed: 2026-04-21

work page 2026
[55]

The Linux Kernel Organization

The Linux Kernel Documentation.The kernel’s command-line parameters. The Linux Kernel Organization. Documentation for Linux kernel boot parameters, including kaslr and nokaslr

work page
[56]

V8 JavaScript Engine

V8 Project Authors. V8 JavaScript Engine. https://v8.dev/, 2026. Google’s open-source JavaScript and WebAssembly engine

work page 2026
[57]

How ai can strengthen digital security

Phil Venables and Royal Hansen. How ai can strengthen digital security. https://blog.goo gle/innovation-and-ai/technology/safety-security/google-ai-cyber-defen se-initiative. Accessed 2026-05

work page 2026
[58]

Baxbench: Can llms generate correct and secure backends? InICML, 2025

Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanovic, Jingxuan He, and Martin Vechev. Baxbench: Can llms generate correct and secure backends? InICML, 2025

work page 2025
[59]

syzkaller: An unsupervised coverage-guided kernel fuzzer.https://github.com/google/syzkaller

Dmitry Vyukov and syzkaller contributors. syzkaller: An unsupervised coverage-guided kernel fuzzer.https://github.com/google/syzkaller. Accessed: 2026-04-21

work page 2026
[60]

SyzVegas: Beating Kernel Fuzzing Odds with Reinforcement Learning

Daimeng Wang, Zheng Zhang, Hang Zhang, Zhiyun Qian, Srikanth V Krishnamurthy, and Nael Abu-Ghazaleh. SyzVegas: Beating Kernel Fuzzing Odds with Reinforcement Learning. In USENIX Security Symposium, 2021

work page 2021
[61]

InUSENIX Security, 2021

Yan Wang, Chao Zhang, Zixuan Zhao, Bolun Zhang, Xiaorui Gong, and Wei Zou.{MAZE}: Towards automated heap feng shui. InUSENIX Security, 2021

work page 2021
[62]

Cyber- gym: Evaluating AI agents’ real-world cybersecurity capabilities at scale

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. Cyber- gym: Evaluating AI agents’ real-world cybersecurity capabilities at scale. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[63]

PATCHAGENT: A practical program repair agent mimicking human expertise

Zheng Yu, Ziyi Guo, Yuhang Wu, Jiahao Yu, Meng Xu, Dongliang Mu, Yan Chen, and Xinyu Xing. PATCHAGENT: A practical program repair agent mimicking human expertise. In USENIX Security, 2025

work page 2025
[64]

Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y

Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y . Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit C...

work page 2025
[65]

Ho, and Percy Liang

Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Julian Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Haoxiang Yang, Aolin Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Kenny O Oseleononmen, Dan Boneh, Da...

work page 2025
[66]

Fuzzing: A Survey for Roadmap

Xiaogang Zhu, Sheng Wen, Seyit Camtepe, and Yang Xiang. Fuzzing: A Survey for Roadmap. ACM Computing Surveys (CSUR), 54(11s), 2022

work page 2022
[67]

CVE-bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. CVE-bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities. In42nd International Conference on Machine Le...

work page 2025
[68]

Network Restrictions for Agents.To minimize security risks and potential reward hacking through web search, each agent’s network access is mediated by an egress proxy

two AMD EPYC 9654 CPUs (192 physical cores / 384 logical cores) with 768GB of RAM; 3) c4-standard-288instances on GCP with 288 vCPUs and 1,080 GB of RAM. Network Restrictions for Agents.To minimize security risks and potential reward hacking through web search, each agent’s network access is mediated by an egress proxy. By default, only the Docker interna...

work page