Recognition: 2 theorem links
· Lean TheoremExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?
Pith reviewed 2026-05-13 02:52 UTC · model grok-4.3
The pith
Frontier AI models can turn given vulnerability triggers into working exploits for over 100 real cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ExploitGym is a benchmark of 898 real-world vulnerability instances in containerized environments. Given a triggering input, agents must produce a working exploit that delivers concrete security impact. Frontier models such as Claude Mythos Preview succeed on 157 instances and GPT-5.5 on 120 instances, retaining non-trivial success even after common defenses are enabled.
What carries the argument
ExploitGym benchmark, which supplies containerized instances and requires agents to extend vulnerability triggers into full exploits while varying applied protections.
If this is right
- Frontier models achieve concrete security impact on a measurable fraction of tested vulnerabilities.
- Widely deployed defenses reduce but do not eliminate model success at exploitation.
- The benchmark supplies a reproducible way to compare and track agent exploitation performance over time.
- Results indicate that AI agents are approaching practical offensive utility in cybersecurity.
Where Pith is reading between the lines
- Further gains on this task could reduce the human effort required to weaponize newly discovered vulnerabilities.
- Dual-use findings may accelerate both automated defense tools and automated attack tooling in parallel.
- Persistent success under defenses suggests the need for new mitigation strategies that target AI reasoning patterns rather than traditional protections alone.
Load-bearing premise
That success when a triggering input is supplied in isolated containers reflects the skills needed for exploitation in uncontrolled real-world settings that demand additional setup and discovery.
What would settle it
An experiment in which the same models are given no pre-supplied trigger and must operate in non-containerized environments yet still fail to produce any working exploits.
Figures
read the original abstract
AI agents are rapidly gaining capabilities that could significantly reshape cybersecurity, making rigorous evaluation urgent. A critical capability is exploitation: turning a vulnerability, which is not yet an attack, into a concrete security impact, such as unauthorized file access or code execution. Exploitation is a particularly challenging task because it requires low-level program reasoning (e.g., about memory layout), runtime adaptation, and sustained progress over long horizons. Meanwhile, it is inherently dual-use, supporting defensive workflows while lowering the barrier for offense. Despite its importance and diagnostic value, exploitation remains under-evaluated. To address this gap, we introduce ExploitGym, a large-scale, diverse, realistic benchmark on the exploitation capabilities of AI agents. Given a program input that triggers a vulnerability, ExploitGym tasks agents with progressively extending it into a working exploit. The benchmark comprises 898 instances sourced from real-world vulnerabilities across three domains, including userspace programs, Google's V8 JavaScript engine, and the Linux kernel. We vary the security protections applied to each instance, isolating their impact on agent performance. All configurations are packaged in reproducible containerized environments. Our evaluation shows that while exploitation remains challenging, frontier models can successfully exploit a non-trivial fraction of vulnerabilities. For example, the strongest configurations are Anthropic's latest model Claude Mythos Preview and OpenAI's GPT-5.5, which produce working exploits for 157 and 120 instances, respectively. Notably, even with widely used defenses enabled, models retain non-trivial success rates. These results establish ExploitGym as an effective testbed for exploitation and highlight the growing cybersecurity risks posed by increasingly capable AI agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ExploitGym, a benchmark of 898 real-world vulnerability instances drawn from userspace programs, Google's V8 engine, and the Linux kernel. Agents receive a vulnerability-triggering input and are tasked with extending it into a working exploit inside reproducible containerized environments; security protections are systematically varied to isolate their effects. Evaluation results indicate that frontier models achieve non-trivial success, with Claude Mythos Preview succeeding on 157 instances and GPT-5.5 on 120 instances, and that success rates remain positive even when common defenses are enabled.
Significance. If the reported success rates are reproducible under the stated protocol, ExploitGym supplies a useful large-scale empirical testbed for measuring AI agents' low-level program reasoning and exploit-construction capabilities. The inclusion of multiple domains, explicit defense variations, and containerized reproducibility are strengths that allow controlled isolation of factors. The work therefore contributes a concrete artifact for tracking progress on a dual-use cybersecurity task.
major comments (2)
- Abstract and §3 (Benchmark Construction): the headline claim that models 'produce working exploits for 157 and 120 instances' and thereby demonstrate 'real-world exploitation capabilities' is not fully supported by the described protocol. The setup always supplies the triggering input and executes inside known, reproducible containers; this removes reconnaissance, trigger discovery, environment fingerprinting, and adaptation to unknown ASLR or external dependencies. The reported numbers therefore measure PoC extension under scaffolding rather than autonomous real-world exploitation, and the abstract does not quantify how much the success rates would drop without these prerequisites.
- §4 (Evaluation) and abstract: the manuscript reports specific success counts and notes the inclusion of defenses but provides no details on the exact success-verification criteria (e.g., what constitutes a 'working exploit'), the instance-selection process, or the prompting strategies used. These omissions leave moderate gaps in support for the central performance claims and make it difficult for readers to assess whether the 157/120 figures are robust or sensitive to minor protocol changes.
minor comments (2)
- §2 (Related Work): the discussion of prior AI-for-security benchmarks could more explicitly contrast ExploitGym's focus on exploit extension with existing fuzzing or vulnerability-discovery suites.
- Figure 1 and Table 2: axis labels and legend entries use inconsistent capitalization and abbreviation style; a uniform notation would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We agree that the manuscript requires greater precision in describing the benchmark's scope and have revised the abstract and relevant sections to address the concerns. Our point-by-point responses to the major comments are provided below.
read point-by-point responses
-
Referee: [—] Abstract and §3 (Benchmark Construction): the headline claim that models 'produce working exploits for 157 and 120 instances' and thereby demonstrate 'real-world exploitation capabilities' is not fully supported by the described protocol. The setup always supplies the triggering input and executes inside known, reproducible containers; this removes reconnaissance, trigger discovery, environment fingerprinting, and adaptation to unknown ASLR or external dependencies. The reported numbers therefore measure PoC extension under scaffolding rather than autonomous real-world exploitation, and the abstract does not quantify how much the success rates would drop without these prerequisites.
Authors: We thank the referee for highlighting this distinction. The manuscript already states that agents are 'Given a program input that triggers a vulnerability' and tasked with extending it into a working exploit; the design intentionally isolates the exploitation phase to focus on low-level program reasoning and long-horizon adaptation, which remain challenging even with the trigger supplied. We do not claim to measure fully autonomous real-world attacks that include reconnaissance or environment discovery. In the revised manuscript we have updated the abstract to clarify that ExploitGym evaluates exploit generation from provided triggers within controlled, reproducible environments rather than claiming broad 'real-world exploitation capabilities'. We have also added an explicit limitations paragraph in §3 discussing the scaffolding provided. We cannot quantify the performance drop that would occur without the supplied trigger, as that would require a different experimental protocol outside the current benchmark design and is noted as future work. revision: partial
-
Referee: [—] §4 (Evaluation) and abstract: the manuscript reports specific success counts and notes the inclusion of defenses but provides no details on the exact success-verification criteria (e.g., what constitutes a 'working exploit'), the instance-selection process, or the prompting strategies used. These omissions leave moderate gaps in support for the central performance claims and make it difficult for readers to assess whether the 157/120 figures are robust or sensitive to minor protocol changes.
Authors: We agree that these details are essential for assessing robustness and reproducibility. In the revised manuscript we have expanded §4 with a new subsection on evaluation protocol. It now specifies the success-verification criteria (automated checks confirming the intended security impact, such as arbitrary code execution or unauthorized file access, via the container test harness), the instance-selection process (sourcing from public CVE databases, filtering for reproducible triggers and containerizable environments, and stratified sampling across the three domains), and the prompting strategies (including the base system prompt, chain-of-thought instructions, and handling of long-horizon interactions). These additions directly support the reported success counts and allow readers to evaluate sensitivity to protocol variations. revision: yes
Circularity Check
Empirical benchmark with no derivation chain or fitted predictions
full rationale
The paper introduces ExploitGym as a benchmark and reports empirical success rates from running frontier models on 898 containerized vulnerability instances. There are no equations, derivations, parameter fits, or predictions that reduce to inputs by construction. All results are generated by direct execution in external reproducible environments, with no self-citation load-bearing on uniqueness theorems or ansatzes. The evaluation protocol is self-contained against the provided benchmark data.
Axiom & Free-Parameter Ledger
invented entities (1)
-
ExploitGym benchmark
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Given a program input that triggers a vulnerability, ExploitGym tasks agents with progressively extending it into a working exploit... frontier models... produce working exploits for 157 and 120 instances
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We vary the security protections applied to each instance, isolating their impact on agent performance
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J. Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of LLM agents. InProceedings of the Thirteenth International Conference on Learning Representat...
work page 2025
-
[2]
Anthropic. Introducing claude opus 4.7. https://www.anthropic.com/news/claude-o pus-4-7. Accessed 2026-05
work page 2026
-
[3]
Making frontier cybersecurity capabilities available to defenders
Anthropic. Making frontier cybersecurity capabilities available to defenders. https://www. anthropic.com/news/claude-code-security. Accessed 2026-05
work page 2026
-
[4]
Real-time cyber safeguards on claude, 2025
Anthropic. Real-time cyber safeguards on claude, 2025. Accessed: 2025-05-06
work page 2025
-
[5]
AEG: automatic exploit generation
Thanassis Avgerinos, Sang Kil Cha, Brent Lim Tze Hao, and David Brumley. AEG: automatic exploit generation. InNetwork and Distributed System Security, 2011
work page 2011
-
[6]
Empirical security analysis of software-based fault isolation through controlled fault injection
Nils Bars, Lukas Bernhard, Moritz Schloegel, and Thorsten Holz. Empirical security analysis of software-based fault isolation through controlled fault injection. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, CCS ’25, page 2639–2652, New York, NY , USA, 2025. Association for Computing Machinery
work page 2025
-
[7]
Unleashing mayhem on binary code
Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. Unleashing mayhem on binary code. InIEEE Symposium on Security and Privacy, 2012
work page 2012
-
[8]
Js-fuzzer – javascript fuzzer for stand-alone shells like d8, chakra, jsc or spidermonkey
Oliver Chang. Js-fuzzer – javascript fuzzer for stand-alone shells like d8, chakra, jsc or spidermonkey. https://chromium.googlesource.com/v8/v8/+/master/tools/clust erfuzz/js_fuzzer/README.md. Accessed 2026-05
work page 2026
-
[9]
TypePulse: Detecting Type Confusion Bugs in Rust Programs
Hung-Mao Chen, Xu He, Shu Wang, Xiaokuan Zhang, and Kun Sun. TypePulse: Detecting Type Confusion Bugs in Rust Programs. InUSENIX Security Symposium, 2025
work page 2025
-
[10]
Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, et al. Secureagentbench: Benchmarking secure code generation under realistic vulnerability scenarios.arXiv preprint arXiv:2509.22097, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
StackGuard: Automatic adaptive detection and prevention of buffer-overflow attacks
Crispin Cowan, Calton Pu, Dave Maier, Jonathan Walpole, Peat Bakke, Steve Beattie, Aaron Grier, Perry Wagle, Qian Zhang, and Heather Hinton. StackGuard: Automatic adaptive detection and prevention of buffer-overflow attacks. InProceedings of the 7th USENIX Security Symposium, pages 63–78, 1998
work page 1998
-
[12]
Security enhancements in Red Hat Enterprise Linux (position-independent executables)
Ulrich Drepper. Security enhancements in Red Hat Enterprise Linux (position-independent executables). Technical report, Red Hat, Inc., 2003
work page 2003
-
[13]
FFmpeg. Ffmpeg: A complete, cross-platform solution to record, convert and stream audio and video.https://www.ffmpeg.org/. Accessed: 2025-05-10
work page 2025
-
[14]
AFL++ : Combin- ing incremental steps of fuzzing research
Andrea Fioraldi, Dominik Christian Maier, Heiko Eißfeldt, and Marc Heuse. AFL++ : Combin- ing incremental steps of fuzzing research. InUSENIX Workshop on Offensive Technologies, 2020
work page 2020
-
[15]
A deep dive into V8 sandbox escape technique used in in-the-wild exploit
Frontier Squad. A deep dive into V8 sandbox escape technique used in in-the-wild exploit. Theori Blog, January 2024. Accessed: 2026-05-09
work page 2024
-
[16]
nsjail: A light-weight process isolation tool
Google. nsjail: A light-weight process isolation tool. https://github.com/google/nsjail. Accessed: 2026-04-21
work page 2026
-
[17]
syzbot: Continuous kernel fuzzing dashboard
Google. syzbot: Continuous kernel fuzzing dashboard. https://syzkaller.appspot.com/ upstream. Accessed: 2026-04-21
work page 2026
-
[18]
ClusterFuzz: Scalable fuzzing infrastructure
Google. ClusterFuzz: Scalable fuzzing infrastructure. https://github.com/google/clus terfuzz, 2019. Accessed: 2026-04-21. 11
work page 2019
-
[19]
OSS-Fuzz: Continuous fuzzing for open source software
Google. OSS-Fuzz: Continuous fuzzing for open source software. https://github.com/g oogle/oss-fuzz, 2026
work page 2026
-
[20]
OSV: Open Source Vulnerabilities
Google. OSV: Open Source Vulnerabilities. https://osv.dev/ , 2026. Comprehensive vulnerability database for open source projects and dependencies. Accessed: 2026-05-03
work page 2026
-
[21]
kernelctf: Kernel capture-the-flag (rules and infrastructure)
Google Security Research. kernelctf: Kernel capture-the-flag (rules and infrastructure). https: //github.com/google/security-research/tree/master/kernelctf . Accessed 2026-04
work page 2026
-
[22]
The V8 heap sandbox.https://v8.dev/blog/sandbox, 2024
Samuel Groß. The V8 heap sandbox.https://v8.dev/blog/sandbox, 2024
work page 2024
-
[23]
Fuzzilli: Fuzzing for JavaScript JIT Compiler Vulnerabilities
Samuel Groß, Simon Koch, Lukas Bernhard, Thorsten Holz, and Martin Johns. Fuzzilli: Fuzzing for JavaScript JIT Compiler Vulnerabilities. InNDSS, 2023
work page 2023
-
[24]
TypeSan: Practical Type Confusion Detection
Istvan Haller, Yuseok Jeon, Hui Peng, Mathias Payer, Cristiano Giuffrida, Herbert Bos, and Erik Van Der Kouwe. TypeSan: Practical Type Confusion Detection. InACM SIGSAC Conference on Computer and Communications Security, 2016
work page 2016
-
[25]
Automatic heap layout manipulation for exploitation
Sean Heelan, Tom Melham, and Daniel Kroening. Automatic heap layout manipulation for exploitation. InUSENIX Security, 2018
work page 2018
-
[26]
Gollum: Modular and greybox exploit generation for heap overflows in interpreters
Sean Heelan, Tom Melham, and Daniel Kroening. Gollum: Modular and greybox exploit generation for heap overflows in interpreters. InACM SIGSAC Conference on Computer and Communications Security, 2019
work page 2019
-
[27]
Data-oriented programming: On the expressiveness of non-control data attacks
Hong Hu, Shweta Shinde, Sendroiu Adrian, Zheng Leong Chua, Prateek Saxena, and Zhenkai Liang. Data-oriented programming: On the expressiveness of non-control data attacks. InIEEE Symposium on Security and Privacy, 2016
work page 2016
-
[28]
SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024
work page 2024
-
[29]
SEC-bench: Automated Bench- marking of LLM Agents on Real-World Software Security Tasks
Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. SEC-bench: Automated Bench- marking of LLM Agents on Real-World Software Security Tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[30]
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew Bo Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert-V oss, Cort B Breuer, Andy Z...
work page 2024
-
[31]
user_namespaces(7) — overview of Linux user namespaces
Linux man-pages project. user_namespaces(7) — overview of Linux user namespaces. https://man7.org/linux/man-pages/man7/user_namespaces.7.html, 2024
work page 2024
-
[32]
libfuzzer – a library for coverage-guided fuzz testing
LLVM. libfuzzer – a library for coverage-guided fuzz testing. https://llvm.org/docs/Li bFuzzer.html. Accessed 2026-05
work page 2026
-
[33]
Valentin JM Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J Schwartz, and Maverick Woo. The Art, Science, and Engineering of Fuzzing: A Survey.IEEE Transactions on Software Engineering, 47(11), 2019
work page 2019
-
[34]
National Institute of Standards and Technology. CVE-2022-4543 Detail. https://nvd.ni st.gov/vuln/detail/CVE-2022-4543, 2023. National Vulnerability Database. Published January 11, 2023; last modified April 8, 2025. Accessed May 9, 2026. 12
work page 2022
-
[35]
SECODEPLT: A unified benchmark for evaluating the security risks and capabilities of code genAI
Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, and Dawn Song. SECODEPLT: A unified benchmark for evaluating the security risks and capabilities of code genAI. InNeurIPS Datasets and Benchmarks Track, 2025
work page 2025
-
[36]
Smashing the stack for fun and profit.Phrack Magazine, 1996
Aleph One. Smashing the stack for fun and profit.Phrack Magazine, 1996
work page 1996
-
[37]
Frontier risk and preparedness
OpenAI. Frontier risk and preparedness. https://openai.com/index/frontier-risk-a nd-preparedness. Accessed 2026-05
work page 2026
-
[38]
OpenAI. Introducing gpt -5.5. https://openai.com/index/introducing-gpt-5-5 . Accessed 2026-05
work page 2026
-
[39]
Trusted access for cyber, 2025
OpenAI. Trusted access for cyber, 2025. Accessed: 2025-05-06
work page 2025
-
[40]
Openssl: Tls/ssl and crypto library
OpenSSL. Openssl: Tls/ssl and crypto library. https://github.com/openssl/openssl . Accessed: 2025-09-15
work page 2025
-
[41]
Fuzzing JavaScript Engines with Aspect-Preserving Mutation
Soyeon Park, Wen Xu, Insu Yun, Daehee Jang, and Taesoo Kim. Fuzzing JavaScript Engines with Aspect-Preserving Mutation. InIEEE Symposium on Security and Privacy (SP), 2020
work page 2020
-
[42]
PaX address space layout randomization (ASLR)
PaX Team. PaX address space layout randomization (ASLR). https://pax.grsecurity.n et/docs/aslr.txt, 2001
work page 2001
-
[43]
CVE-2016-8655: Linux AF_PACKET race condition local root exploit
Philip Pettersson. CVE-2016-8655: Linux AF_PACKET race condition local root exploit. oss-security mailing list, December 2016. Exploit source: https://github.com/bcoles/ kernel-exploits/blob/master/CVE-2016-8655/chocobo_root.c
work page 2016
-
[44]
Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca ...
-
[45]
Return-oriented program- ming: Systems, languages, and applications.ACM Trans
Ryan Roemer, Erik Buchanan, Hovav Shacham, and Stefan Savage. Return-oriented program- ming: Systems, languages, and applications.ACM Trans. Inf. Syst. Secur., 15(1):2:1–2:34, 2012
work page 2012
-
[46]
V8 sandbox escape via regexp bytecode modification
rycbar77. V8 sandbox escape via regexp bytecode modification. GitHub, 2024. Technique used in V8CTF M122, M123, and PlaidCTF
work page 2024
-
[47]
kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels
Sergej Schumilo, Cornelius Aschermann, Robert Gawlik, Sebastian Schinzel, and Thorsten Holz. kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels. InUSENIX Security Symposium, 2017
work page 2017
-
[48]
The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86)
Hovav Shacham. The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86). InACM Conference on Computer and Communications Security, 2007
work page 2007
-
[49]
Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, et al. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems, 37:57472–57498, 2024
work page 2024
-
[50]
Secrepobench: Benchmarking code agents for secure code completion in real-world repositories
Chihao Shen, Connor Dilgren, Purva Chiniya, Luke Griffith, Yu Ding, and Yizheng Chen. Secrepobench: Benchmarking code agents for secure code completion in real-world repositories. arXiv preprint arXiv:2504.21205, 2025
-
[51]
Sok:(state of) the art of war: Offensive techniques in binary analysis
Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, et al. Sok:(state of) the art of war: Offensive techniques in binary analysis. In2016 IEEE symposium on security and privacy (SP), pages 138–157. IEEE, 2016
work page 2016
-
[52]
Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. Sok: Eternal war in memory. In IEEE Symposium on Security and Privacy, 2013. 13
work page 2013
-
[53]
Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. SoK: Eternal War in Memory. In IEEE Symposium on Security and Privacy, 2013
work page 2013
-
[54]
The Chromium Project. Chromium issue tracker. https://issues.chromium.org/ . Accessed: 2026-04-21
work page 2026
-
[55]
The Linux Kernel Documentation.The kernel’s command-line parameters. The Linux Kernel Organization. Documentation for Linux kernel boot parameters, including kaslr and nokaslr
-
[56]
V8 Project Authors. V8 JavaScript Engine. https://v8.dev/, 2026. Google’s open-source JavaScript and WebAssembly engine
work page 2026
-
[57]
How ai can strengthen digital security
Phil Venables and Royal Hansen. How ai can strengthen digital security. https://blog.goo gle/innovation-and-ai/technology/safety-security/google-ai-cyber-defen se-initiative. Accessed 2026-05
work page 2026
-
[58]
Baxbench: Can llms generate correct and secure backends? InICML, 2025
Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanovic, Jingxuan He, and Martin Vechev. Baxbench: Can llms generate correct and secure backends? InICML, 2025
work page 2025
-
[59]
syzkaller: An unsupervised coverage-guided kernel fuzzer.https://github.com/google/syzkaller
Dmitry Vyukov and syzkaller contributors. syzkaller: An unsupervised coverage-guided kernel fuzzer.https://github.com/google/syzkaller. Accessed: 2026-04-21
work page 2026
-
[60]
SyzVegas: Beating Kernel Fuzzing Odds with Reinforcement Learning
Daimeng Wang, Zheng Zhang, Hang Zhang, Zhiyun Qian, Srikanth V Krishnamurthy, and Nael Abu-Ghazaleh. SyzVegas: Beating Kernel Fuzzing Odds with Reinforcement Learning. In USENIX Security Symposium, 2021
work page 2021
-
[61]
Yan Wang, Chao Zhang, Zixuan Zhao, Bolun Zhang, Xiaorui Gong, and Wei Zou.{MAZE}: Towards automated heap feng shui. InUSENIX Security, 2021
work page 2021
-
[62]
Cyber- gym: Evaluating AI agents’ real-world cybersecurity capabilities at scale
Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. Cyber- gym: Evaluating AI agents’ real-world cybersecurity capabilities at scale. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[63]
PATCHAGENT: A practical program repair agent mimicking human expertise
Zheng Yu, Ziyi Guo, Yuhang Wu, Jiahao Yu, Meng Xu, Dongliang Mu, Yan Chen, and Xinyu Xing. PATCHAGENT: A practical program repair agent mimicking human expertise. In USENIX Security, 2025
work page 2025
-
[64]
Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y
Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y . Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit C...
work page 2025
-
[65]
Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Julian Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Haoxiang Yang, Aolin Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Kenny O Oseleononmen, Dan Boneh, Da...
work page 2025
-
[66]
Xiaogang Zhu, Sheng Wen, Seyit Camtepe, and Yang Xiang. Fuzzing: A Survey for Roadmap. ACM Computing Surveys (CSUR), 54(11s), 2022
work page 2022
-
[67]
CVE-bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities
Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. CVE-bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities. In42nd International Conference on Machine Le...
work page 2025
-
[68]
two AMD EPYC 9654 CPUs (192 physical cores / 384 logical cores) with 768GB of RAM; 3) c4-standard-288instances on GCP with 288 vCPUs and 1,080 GB of RAM. Network Restrictions for Agents.To minimize security risks and potential reward hacking through web search, each agent’s network access is mediated by an egress proxy. By default, only the Docker interna...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.