pith. machine review for the scientific record. sign in

arxiv: 2605.11086 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:52 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords AI agentsvulnerability exploitationcybersecurity benchmarkfrontier modelssecurity defensesreal-world vulnerabilitiescode execution
0
0 comments X

The pith

Frontier AI models can turn given vulnerability triggers into working exploits for over 100 real cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ExploitGym to measure whether AI agents can extend a provided input that triggers a vulnerability into a complete exploit achieving effects such as code execution or unauthorized access. It supplies 898 instances drawn from real userspace programs, the V8 JavaScript engine, and the Linux kernel, each packaged in a reproducible container with optional security protections. Evaluation of frontier models shows success rates of 157 exploits for one leading configuration and 120 for another, with meaningful performance retained when defenses are active. A sympathetic reader would care because exploitation bridges the gap between a known weakness and actual harm, and the benchmark quantifies how capable current agents are at crossing that gap in controlled settings.

Core claim

ExploitGym is a benchmark of 898 real-world vulnerability instances in containerized environments. Given a triggering input, agents must produce a working exploit that delivers concrete security impact. Frontier models such as Claude Mythos Preview succeed on 157 instances and GPT-5.5 on 120 instances, retaining non-trivial success even after common defenses are enabled.

What carries the argument

ExploitGym benchmark, which supplies containerized instances and requires agents to extend vulnerability triggers into full exploits while varying applied protections.

If this is right

  • Frontier models achieve concrete security impact on a measurable fraction of tested vulnerabilities.
  • Widely deployed defenses reduce but do not eliminate model success at exploitation.
  • The benchmark supplies a reproducible way to compare and track agent exploitation performance over time.
  • Results indicate that AI agents are approaching practical offensive utility in cybersecurity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Further gains on this task could reduce the human effort required to weaponize newly discovered vulnerabilities.
  • Dual-use findings may accelerate both automated defense tools and automated attack tooling in parallel.
  • Persistent success under defenses suggests the need for new mitigation strategies that target AI reasoning patterns rather than traditional protections alone.

Load-bearing premise

That success when a triggering input is supplied in isolated containers reflects the skills needed for exploitation in uncontrolled real-world settings that demand additional setup and discovery.

What would settle it

An experiment in which the same models are given no pre-supplied trigger and must operate in non-containerized environments yet still fail to produce any working exploits.

Figures

Figures reproduced from arXiv: 2605.11086 by Dawn Song, Elie Bursztein, Eric Wallace, Hongwei Li, Jingxuan He, Kurt Thomas, Luca Invernizzi, Milad Nasr, Nicholas Carlini, Nico Schiller, Srijiith Sesha Narayana, Thorsten Holz, Wenbo Guo, Xiangyu Qi, Yan Shoshitaishvili, Zhun Wang.

Figure 1
Figure 1. Figure 1: Overview of ExploitGym. A vulnerability paired with a mitigation setting defines a task [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cumulative exploits over wall-clock time (6-hour max.) In contrast, Claude Mythos Preview climbs steeply through the first hour and, crucially, continues to accumulate suc￾cesses well beyond the two-hour mark without reaching a clear plateau. This non-saturating trajectory underscores Claude Mythos Preview’s ability to sustain long-horizon agentic workflows such as incremental refinement of exploit primiti… view at source ↗
Figure 3
Figure 3. Figure 3: Overlap of successes across Claude Mythos Preview, GPT-5.5, and union of other models. Different Models Solve Complementary Sets of Tasks [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Shortened trajectory of an agent exploiting a V8 vulnerability. Starting from a PoV that [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

AI agents are rapidly gaining capabilities that could significantly reshape cybersecurity, making rigorous evaluation urgent. A critical capability is exploitation: turning a vulnerability, which is not yet an attack, into a concrete security impact, such as unauthorized file access or code execution. Exploitation is a particularly challenging task because it requires low-level program reasoning (e.g., about memory layout), runtime adaptation, and sustained progress over long horizons. Meanwhile, it is inherently dual-use, supporting defensive workflows while lowering the barrier for offense. Despite its importance and diagnostic value, exploitation remains under-evaluated. To address this gap, we introduce ExploitGym, a large-scale, diverse, realistic benchmark on the exploitation capabilities of AI agents. Given a program input that triggers a vulnerability, ExploitGym tasks agents with progressively extending it into a working exploit. The benchmark comprises 898 instances sourced from real-world vulnerabilities across three domains, including userspace programs, Google's V8 JavaScript engine, and the Linux kernel. We vary the security protections applied to each instance, isolating their impact on agent performance. All configurations are packaged in reproducible containerized environments. Our evaluation shows that while exploitation remains challenging, frontier models can successfully exploit a non-trivial fraction of vulnerabilities. For example, the strongest configurations are Anthropic's latest model Claude Mythos Preview and OpenAI's GPT-5.5, which produce working exploits for 157 and 120 instances, respectively. Notably, even with widely used defenses enabled, models retain non-trivial success rates. These results establish ExploitGym as an effective testbed for exploitation and highlight the growing cybersecurity risks posed by increasingly capable AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ExploitGym, a benchmark of 898 real-world vulnerability instances drawn from userspace programs, Google's V8 engine, and the Linux kernel. Agents receive a vulnerability-triggering input and are tasked with extending it into a working exploit inside reproducible containerized environments; security protections are systematically varied to isolate their effects. Evaluation results indicate that frontier models achieve non-trivial success, with Claude Mythos Preview succeeding on 157 instances and GPT-5.5 on 120 instances, and that success rates remain positive even when common defenses are enabled.

Significance. If the reported success rates are reproducible under the stated protocol, ExploitGym supplies a useful large-scale empirical testbed for measuring AI agents' low-level program reasoning and exploit-construction capabilities. The inclusion of multiple domains, explicit defense variations, and containerized reproducibility are strengths that allow controlled isolation of factors. The work therefore contributes a concrete artifact for tracking progress on a dual-use cybersecurity task.

major comments (2)
  1. Abstract and §3 (Benchmark Construction): the headline claim that models 'produce working exploits for 157 and 120 instances' and thereby demonstrate 'real-world exploitation capabilities' is not fully supported by the described protocol. The setup always supplies the triggering input and executes inside known, reproducible containers; this removes reconnaissance, trigger discovery, environment fingerprinting, and adaptation to unknown ASLR or external dependencies. The reported numbers therefore measure PoC extension under scaffolding rather than autonomous real-world exploitation, and the abstract does not quantify how much the success rates would drop without these prerequisites.
  2. §4 (Evaluation) and abstract: the manuscript reports specific success counts and notes the inclusion of defenses but provides no details on the exact success-verification criteria (e.g., what constitutes a 'working exploit'), the instance-selection process, or the prompting strategies used. These omissions leave moderate gaps in support for the central performance claims and make it difficult for readers to assess whether the 157/120 figures are robust or sensitive to minor protocol changes.
minor comments (2)
  1. §2 (Related Work): the discussion of prior AI-for-security benchmarks could more explicitly contrast ExploitGym's focus on exploit extension with existing fuzzing or vulnerability-discovery suites.
  2. Figure 1 and Table 2: axis labels and legend entries use inconsistent capitalization and abbreviation style; a uniform notation would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We agree that the manuscript requires greater precision in describing the benchmark's scope and have revised the abstract and relevant sections to address the concerns. Our point-by-point responses to the major comments are provided below.

read point-by-point responses
  1. Referee: [—] Abstract and §3 (Benchmark Construction): the headline claim that models 'produce working exploits for 157 and 120 instances' and thereby demonstrate 'real-world exploitation capabilities' is not fully supported by the described protocol. The setup always supplies the triggering input and executes inside known, reproducible containers; this removes reconnaissance, trigger discovery, environment fingerprinting, and adaptation to unknown ASLR or external dependencies. The reported numbers therefore measure PoC extension under scaffolding rather than autonomous real-world exploitation, and the abstract does not quantify how much the success rates would drop without these prerequisites.

    Authors: We thank the referee for highlighting this distinction. The manuscript already states that agents are 'Given a program input that triggers a vulnerability' and tasked with extending it into a working exploit; the design intentionally isolates the exploitation phase to focus on low-level program reasoning and long-horizon adaptation, which remain challenging even with the trigger supplied. We do not claim to measure fully autonomous real-world attacks that include reconnaissance or environment discovery. In the revised manuscript we have updated the abstract to clarify that ExploitGym evaluates exploit generation from provided triggers within controlled, reproducible environments rather than claiming broad 'real-world exploitation capabilities'. We have also added an explicit limitations paragraph in §3 discussing the scaffolding provided. We cannot quantify the performance drop that would occur without the supplied trigger, as that would require a different experimental protocol outside the current benchmark design and is noted as future work. revision: partial

  2. Referee: [—] §4 (Evaluation) and abstract: the manuscript reports specific success counts and notes the inclusion of defenses but provides no details on the exact success-verification criteria (e.g., what constitutes a 'working exploit'), the instance-selection process, or the prompting strategies used. These omissions leave moderate gaps in support for the central performance claims and make it difficult for readers to assess whether the 157/120 figures are robust or sensitive to minor protocol changes.

    Authors: We agree that these details are essential for assessing robustness and reproducibility. In the revised manuscript we have expanded §4 with a new subsection on evaluation protocol. It now specifies the success-verification criteria (automated checks confirming the intended security impact, such as arbitrary code execution or unauthorized file access, via the container test harness), the instance-selection process (sourcing from public CVE databases, filtering for reproducible triggers and containerizable environments, and stratified sampling across the three domains), and the prompting strategies (including the base system prompt, chain-of-thought instructions, and handling of long-horizon interactions). These additions directly support the reported success counts and allow readers to evaluate sensitivity to protocol variations. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or fitted predictions

full rationale

The paper introduces ExploitGym as a benchmark and reports empirical success rates from running frontier models on 898 containerized vulnerability instances. There are no equations, derivations, parameter fits, or predictions that reduce to inputs by construction. All results are generated by direct execution in external reproducible environments, with no self-citation load-bearing on uniqueness theorems or ansatzes. The evaluation protocol is self-contained against the provided benchmark data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the construction of the new benchmark and the empirical model evaluations performed within it. No free parameters, additional axioms, or invented physical entities are required beyond standard practices for creating reproducible security benchmarks.

invented entities (1)
  • ExploitGym benchmark no independent evidence
    purpose: To evaluate AI agents' ability to turn vulnerability triggers into working exploits
    The benchmark itself is the primary new contribution introduced by the paper.

pith-pipeline@v0.9.0 · 5655 in / 1333 out tokens · 123173 ms · 2026-05-13T02:52:48.710819+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

  1. [1]

    Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J. Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of LLM agents. InProceedings of the Thirteenth International Conference on Learning Representat...

  2. [2]

    Introducing claude opus 4.7

    Anthropic. Introducing claude opus 4.7. https://www.anthropic.com/news/claude-o pus-4-7. Accessed 2026-05

  3. [3]

    Making frontier cybersecurity capabilities available to defenders

    Anthropic. Making frontier cybersecurity capabilities available to defenders. https://www. anthropic.com/news/claude-code-security. Accessed 2026-05

  4. [4]

    Real-time cyber safeguards on claude, 2025

    Anthropic. Real-time cyber safeguards on claude, 2025. Accessed: 2025-05-06

  5. [5]

    AEG: automatic exploit generation

    Thanassis Avgerinos, Sang Kil Cha, Brent Lim Tze Hao, and David Brumley. AEG: automatic exploit generation. InNetwork and Distributed System Security, 2011

  6. [6]

    Empirical security analysis of software-based fault isolation through controlled fault injection

    Nils Bars, Lukas Bernhard, Moritz Schloegel, and Thorsten Holz. Empirical security analysis of software-based fault isolation through controlled fault injection. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, CCS ’25, page 2639–2652, New York, NY , USA, 2025. Association for Computing Machinery

  7. [7]

    Unleashing mayhem on binary code

    Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. Unleashing mayhem on binary code. InIEEE Symposium on Security and Privacy, 2012

  8. [8]

    Js-fuzzer – javascript fuzzer for stand-alone shells like d8, chakra, jsc or spidermonkey

    Oliver Chang. Js-fuzzer – javascript fuzzer for stand-alone shells like d8, chakra, jsc or spidermonkey. https://chromium.googlesource.com/v8/v8/+/master/tools/clust erfuzz/js_fuzzer/README.md. Accessed 2026-05

  9. [9]

    TypePulse: Detecting Type Confusion Bugs in Rust Programs

    Hung-Mao Chen, Xu He, Shu Wang, Xiaokuan Zhang, and Kun Sun. TypePulse: Detecting Type Confusion Bugs in Rust Programs. InUSENIX Security Symposium, 2025

  10. [10]

    SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

    Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, et al. Secureagentbench: Benchmarking secure code generation under realistic vulnerability scenarios.arXiv preprint arXiv:2509.22097, 2025

  11. [11]

    StackGuard: Automatic adaptive detection and prevention of buffer-overflow attacks

    Crispin Cowan, Calton Pu, Dave Maier, Jonathan Walpole, Peat Bakke, Steve Beattie, Aaron Grier, Perry Wagle, Qian Zhang, and Heather Hinton. StackGuard: Automatic adaptive detection and prevention of buffer-overflow attacks. InProceedings of the 7th USENIX Security Symposium, pages 63–78, 1998

  12. [12]

    Security enhancements in Red Hat Enterprise Linux (position-independent executables)

    Ulrich Drepper. Security enhancements in Red Hat Enterprise Linux (position-independent executables). Technical report, Red Hat, Inc., 2003

  13. [13]

    Ffmpeg: A complete, cross-platform solution to record, convert and stream audio and video.https://www.ffmpeg.org/

    FFmpeg. Ffmpeg: A complete, cross-platform solution to record, convert and stream audio and video.https://www.ffmpeg.org/. Accessed: 2025-05-10

  14. [14]

    AFL++ : Combin- ing incremental steps of fuzzing research

    Andrea Fioraldi, Dominik Christian Maier, Heiko Eißfeldt, and Marc Heuse. AFL++ : Combin- ing incremental steps of fuzzing research. InUSENIX Workshop on Offensive Technologies, 2020

  15. [15]

    A deep dive into V8 sandbox escape technique used in in-the-wild exploit

    Frontier Squad. A deep dive into V8 sandbox escape technique used in in-the-wild exploit. Theori Blog, January 2024. Accessed: 2026-05-09

  16. [16]

    nsjail: A light-weight process isolation tool

    Google. nsjail: A light-weight process isolation tool. https://github.com/google/nsjail. Accessed: 2026-04-21

  17. [17]

    syzbot: Continuous kernel fuzzing dashboard

    Google. syzbot: Continuous kernel fuzzing dashboard. https://syzkaller.appspot.com/ upstream. Accessed: 2026-04-21

  18. [18]

    ClusterFuzz: Scalable fuzzing infrastructure

    Google. ClusterFuzz: Scalable fuzzing infrastructure. https://github.com/google/clus terfuzz, 2019. Accessed: 2026-04-21. 11

  19. [19]

    OSS-Fuzz: Continuous fuzzing for open source software

    Google. OSS-Fuzz: Continuous fuzzing for open source software. https://github.com/g oogle/oss-fuzz, 2026

  20. [20]

    OSV: Open Source Vulnerabilities

    Google. OSV: Open Source Vulnerabilities. https://osv.dev/ , 2026. Comprehensive vulnerability database for open source projects and dependencies. Accessed: 2026-05-03

  21. [21]

    kernelctf: Kernel capture-the-flag (rules and infrastructure)

    Google Security Research. kernelctf: Kernel capture-the-flag (rules and infrastructure). https: //github.com/google/security-research/tree/master/kernelctf . Accessed 2026-04

  22. [22]

    The V8 heap sandbox.https://v8.dev/blog/sandbox, 2024

    Samuel Groß. The V8 heap sandbox.https://v8.dev/blog/sandbox, 2024

  23. [23]

    Fuzzilli: Fuzzing for JavaScript JIT Compiler Vulnerabilities

    Samuel Groß, Simon Koch, Lukas Bernhard, Thorsten Holz, and Martin Johns. Fuzzilli: Fuzzing for JavaScript JIT Compiler Vulnerabilities. InNDSS, 2023

  24. [24]

    TypeSan: Practical Type Confusion Detection

    Istvan Haller, Yuseok Jeon, Hui Peng, Mathias Payer, Cristiano Giuffrida, Herbert Bos, and Erik Van Der Kouwe. TypeSan: Practical Type Confusion Detection. InACM SIGSAC Conference on Computer and Communications Security, 2016

  25. [25]

    Automatic heap layout manipulation for exploitation

    Sean Heelan, Tom Melham, and Daniel Kroening. Automatic heap layout manipulation for exploitation. InUSENIX Security, 2018

  26. [26]

    Gollum: Modular and greybox exploit generation for heap overflows in interpreters

    Sean Heelan, Tom Melham, and Daniel Kroening. Gollum: Modular and greybox exploit generation for heap overflows in interpreters. InACM SIGSAC Conference on Computer and Communications Security, 2019

  27. [27]

    Data-oriented programming: On the expressiveness of non-control data attacks

    Hong Hu, Shweta Shinde, Sendroiu Adrian, Zheng Leong Chua, Prateek Saxena, and Zhenkai Liang. Data-oriented programming: On the expressiveness of non-control data attacks. InIEEE Symposium on Security and Privacy, 2016

  28. [28]

    SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024

  29. [29]

    SEC-bench: Automated Bench- marking of LLM Agents on Real-World Software Security Tasks

    Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. SEC-bench: Automated Bench- marking of LLM Agents on Real-World Software Security Tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  30. [30]

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew Bo Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert-V oss, Cort B Breuer, Andy Z...

  31. [31]

    user_namespaces(7) — overview of Linux user namespaces

    Linux man-pages project. user_namespaces(7) — overview of Linux user namespaces. https://man7.org/linux/man-pages/man7/user_namespaces.7.html, 2024

  32. [32]

    libfuzzer – a library for coverage-guided fuzz testing

    LLVM. libfuzzer – a library for coverage-guided fuzz testing. https://llvm.org/docs/Li bFuzzer.html. Accessed 2026-05

  33. [33]

    The Art, Science, and Engineering of Fuzzing: A Survey.IEEE Transactions on Software Engineering, 47(11), 2019

    Valentin JM Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J Schwartz, and Maverick Woo. The Art, Science, and Engineering of Fuzzing: A Survey.IEEE Transactions on Software Engineering, 47(11), 2019

  34. [34]

    CVE-2022-4543 Detail

    National Institute of Standards and Technology. CVE-2022-4543 Detail. https://nvd.ni st.gov/vuln/detail/CVE-2022-4543, 2023. National Vulnerability Database. Published January 11, 2023; last modified April 8, 2025. Accessed May 9, 2026. 12

  35. [35]

    SECODEPLT: A unified benchmark for evaluating the security risks and capabilities of code genAI

    Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, and Dawn Song. SECODEPLT: A unified benchmark for evaluating the security risks and capabilities of code genAI. InNeurIPS Datasets and Benchmarks Track, 2025

  36. [36]

    Smashing the stack for fun and profit.Phrack Magazine, 1996

    Aleph One. Smashing the stack for fun and profit.Phrack Magazine, 1996

  37. [37]

    Frontier risk and preparedness

    OpenAI. Frontier risk and preparedness. https://openai.com/index/frontier-risk-a nd-preparedness. Accessed 2026-05

  38. [38]

    Introducing gpt -5.5

    OpenAI. Introducing gpt -5.5. https://openai.com/index/introducing-gpt-5-5 . Accessed 2026-05

  39. [39]

    Trusted access for cyber, 2025

    OpenAI. Trusted access for cyber, 2025. Accessed: 2025-05-06

  40. [40]

    Openssl: Tls/ssl and crypto library

    OpenSSL. Openssl: Tls/ssl and crypto library. https://github.com/openssl/openssl . Accessed: 2025-09-15

  41. [41]

    Fuzzing JavaScript Engines with Aspect-Preserving Mutation

    Soyeon Park, Wen Xu, Insu Yun, Daehee Jang, and Taesoo Kim. Fuzzing JavaScript Engines with Aspect-Preserving Mutation. InIEEE Symposium on Security and Privacy (SP), 2020

  42. [42]

    PaX address space layout randomization (ASLR)

    PaX Team. PaX address space layout randomization (ASLR). https://pax.grsecurity.n et/docs/aslr.txt, 2001

  43. [43]

    CVE-2016-8655: Linux AF_PACKET race condition local root exploit

    Philip Pettersson. CVE-2016-8655: Linux AF_PACKET race condition local root exploit. oss-security mailing list, December 2016. Exploit source: https://github.com/bcoles/ kernel-exploits/blob/master/CVE-2016-8655/chocobo_root.c

  44. [44]

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle

    Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca ...

  45. [45]

    Return-oriented program- ming: Systems, languages, and applications.ACM Trans

    Ryan Roemer, Erik Buchanan, Hovav Shacham, and Stefan Savage. Return-oriented program- ming: Systems, languages, and applications.ACM Trans. Inf. Syst. Secur., 15(1):2:1–2:34, 2012

  46. [46]

    V8 sandbox escape via regexp bytecode modification

    rycbar77. V8 sandbox escape via regexp bytecode modification. GitHub, 2024. Technique used in V8CTF M122, M123, and PlaidCTF

  47. [47]

    kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels

    Sergej Schumilo, Cornelius Aschermann, Robert Gawlik, Sebastian Schinzel, and Thorsten Holz. kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels. InUSENIX Security Symposium, 2017

  48. [48]

    The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86)

    Hovav Shacham. The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86). InACM Conference on Computer and Communications Security, 2007

  49. [49]

    Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems, 37:57472–57498, 2024

    Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, et al. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems, 37:57472–57498, 2024

  50. [50]

    Secrepobench: Benchmarking code agents for secure code completion in real-world repositories

    Chihao Shen, Connor Dilgren, Purva Chiniya, Luke Griffith, Yu Ding, and Yizheng Chen. Secrepobench: Benchmarking code agents for secure code completion in real-world repositories. arXiv preprint arXiv:2504.21205, 2025

  51. [51]

    Sok:(state of) the art of war: Offensive techniques in binary analysis

    Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, et al. Sok:(state of) the art of war: Offensive techniques in binary analysis. In2016 IEEE symposium on security and privacy (SP), pages 138–157. IEEE, 2016

  52. [52]

    Sok: Eternal war in memory

    Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. Sok: Eternal war in memory. In IEEE Symposium on Security and Privacy, 2013. 13

  53. [53]

    SoK: Eternal War in Memory

    Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. SoK: Eternal War in Memory. In IEEE Symposium on Security and Privacy, 2013

  54. [54]

    Chromium issue tracker

    The Chromium Project. Chromium issue tracker. https://issues.chromium.org/ . Accessed: 2026-04-21

  55. [55]

    The Linux Kernel Organization

    The Linux Kernel Documentation.The kernel’s command-line parameters. The Linux Kernel Organization. Documentation for Linux kernel boot parameters, including kaslr and nokaslr

  56. [56]

    V8 JavaScript Engine

    V8 Project Authors. V8 JavaScript Engine. https://v8.dev/, 2026. Google’s open-source JavaScript and WebAssembly engine

  57. [57]

    How ai can strengthen digital security

    Phil Venables and Royal Hansen. How ai can strengthen digital security. https://blog.goo gle/innovation-and-ai/technology/safety-security/google-ai-cyber-defen se-initiative. Accessed 2026-05

  58. [58]

    Baxbench: Can llms generate correct and secure backends? InICML, 2025

    Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanovic, Jingxuan He, and Martin Vechev. Baxbench: Can llms generate correct and secure backends? InICML, 2025

  59. [59]

    syzkaller: An unsupervised coverage-guided kernel fuzzer.https://github.com/google/syzkaller

    Dmitry Vyukov and syzkaller contributors. syzkaller: An unsupervised coverage-guided kernel fuzzer.https://github.com/google/syzkaller. Accessed: 2026-04-21

  60. [60]

    SyzVegas: Beating Kernel Fuzzing Odds with Reinforcement Learning

    Daimeng Wang, Zheng Zhang, Hang Zhang, Zhiyun Qian, Srikanth V Krishnamurthy, and Nael Abu-Ghazaleh. SyzVegas: Beating Kernel Fuzzing Odds with Reinforcement Learning. In USENIX Security Symposium, 2021

  61. [61]

    InUSENIX Security, 2021

    Yan Wang, Chao Zhang, Zixuan Zhao, Bolun Zhang, Xiaorui Gong, and Wei Zou.{MAZE}: Towards automated heap feng shui. InUSENIX Security, 2021

  62. [62]

    Cyber- gym: Evaluating AI agents’ real-world cybersecurity capabilities at scale

    Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. Cyber- gym: Evaluating AI agents’ real-world cybersecurity capabilities at scale. InThe Fourteenth International Conference on Learning Representations, 2026

  63. [63]

    PATCHAGENT: A practical program repair agent mimicking human expertise

    Zheng Yu, Ziyi Guo, Yuhang Wu, Jiahao Yu, Meng Xu, Dongliang Mu, Yan Chen, and Xinyu Xing. PATCHAGENT: A practical program repair agent mimicking human expertise. In USENIX Security, 2025

  64. [64]

    Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y

    Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y . Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit C...

  65. [65]

    Ho, and Percy Liang

    Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Julian Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Haoxiang Yang, Aolin Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Kenny O Oseleononmen, Dan Boneh, Da...

  66. [66]

    Fuzzing: A Survey for Roadmap

    Xiaogang Zhu, Sheng Wen, Seyit Camtepe, and Yang Xiang. Fuzzing: A Survey for Roadmap. ACM Computing Surveys (CSUR), 54(11s), 2022

  67. [67]

    CVE-bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities

    Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. CVE-bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities. In42nd International Conference on Machine Le...

  68. [68]

    Network Restrictions for Agents.To minimize security risks and potential reward hacking through web search, each agent’s network access is mediated by an egress proxy

    two AMD EPYC 9654 CPUs (192 physical cores / 384 logical cores) with 768GB of RAM; 3) c4-standard-288instances on GCP with 288 vCPUs and 1,080 GB of RAM. Network Restrictions for Agents.To minimize security risks and potential reward hacking through web search, each agent’s network access is mediated by an egress proxy. By default, only the Docker interna...