LLM Agents can Autonomously Exploit One-day Vulnerabilities
Pith reviewed 2026-05-18 04:13 UTC · model grok-4.3
The pith
GPT-4 agents autonomously exploit 87 percent of tested one-day vulnerabilities when given their CVE descriptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When given the CVE description, a GPT-4 agent autonomously exploits 87 percent of the 15 one-day vulnerabilities, while GPT-3.5, open-source LLMs, ZAP, and Metasploit exploit 0 percent. The same GPT-4 agent exploits only 7 percent when the CVE description is withheld.
What carries the argument
An LLM agent that receives a CVE description and uses tools to probe and modify a target system in order to trigger the described vulnerability.
If this is right
- Malicious actors could use similar agents to automate attacks on recently disclosed but unpatched systems.
- Organizations running internet-facing software would need faster patching cycles than current norms.
- Access controls on tool-using LLM agents become a direct security requirement rather than an optional safeguard.
- The dependence on CVE descriptions limits but does not remove the risk for future agent versions.
Where Pith is reading between the lines
- Defenders might need new monitoring that flags unusual sequences of system calls or web requests generated by automated agents.
- Vulnerability disclosure processes could face pressure to delay public CVE text if agents improve at using it.
- Testing regimes for new LLM agents should include standardized one-day exploitation benchmarks before public release.
Load-bearing premise
The 15 chosen vulnerabilities and the exact agent tools and prompts used are representative of real-world one-day vulnerabilities and typical LLM agent deployments.
What would settle it
Running the same GPT-4 agent setup on a larger, independently chosen set of one-day vulnerabilities and measuring whether the 87 percent exploitation rate holds or whether success without the CVE description rises substantially above 7 percent.
read the original abstract
LLMs have becoming increasingly powerful, both in their benign and malicious uses. With the increase in capabilities, researchers have been increasingly interested in their ability to exploit cybersecurity vulnerabilities. In particular, recent work has conducted preliminary studies on the ability of LLM agents to autonomously hack websites. However, these studies are limited to simple vulnerabilities. In this work, we show that LLM agents can autonomously exploit one-day vulnerabilities in real-world systems. To show this, we collected a dataset of 15 one-day vulnerabilities that include ones categorized as critical severity in the CVE description. When given the CVE description, GPT-4 is capable of exploiting 87% of these vulnerabilities compared to 0% for every other model we test (GPT-3.5, open-source LLMs) and open-source vulnerability scanners (ZAP and Metasploit). Fortunately, our GPT-4 agent requires the CVE description for high performance: without the description, GPT-4 can exploit only 7% of the vulnerabilities. Our findings raise questions around the widespread deployment of highly capable LLM agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLM agents can autonomously exploit one-day vulnerabilities in real-world systems. It collects a dataset of 15 one-day vulnerabilities (including critical-severity ones) and reports that a GPT-4 agent, when given the CVE description, exploits 87% of them—versus 0% for GPT-3.5, open-source LLMs, and scanners ZAP/Metasploit—while GPT-4 without the CVE description succeeds on only 7%.
Significance. If the empirical results hold under more rigorous controls, the work is significant for demonstrating a concrete performance gap between frontier LLMs and prior systems on autonomous exploitation of real CVEs. It supplies falsifiable, measurable evidence on a specific task and raises timely questions about safe deployment of LLM agents in security-sensitive settings.
major comments (3)
- [§3] §3 (Dataset construction): No explicit selection criteria, diversity metrics (vuln type, software, severity distribution, or exploit complexity), or confirmation that test environments match production deployments are provided. This directly undermines the general claim that the 87% rate reflects a property of one-day vulnerabilities rather than a curated sample.
- [§4] §4 (Agent architecture and evaluation): The manuscript gives insufficient detail on exact agent architecture, tool access, prompting strategy, success criteria, number of trials, or controls for prompt-engineering variations. Without these, the 87% figure cannot be assessed for robustness or reproducibility.
- [§5] Baseline comparison (throughout §5): It is unclear whether ZAP and Metasploit were supplied equivalent CVE descriptions or run under the same environmental constraints as the LLM agent; the 0% result may therefore not constitute a fair head-to-head evaluation.
minor comments (2)
- [Abstract] Abstract: 'LLMs have becoming increasingly powerful' contains a grammatical error and should read 'have become'.
- [§5] Results lack error bars, confidence intervals, or multiple-run statistics for the reported success rates.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. The comments highlight important areas for improving the description of our methodology and evaluation. We respond to each major comment in turn and commit to revisions that address the concerns raised.
read point-by-point responses
-
Referee: §3 (Dataset construction): No explicit selection criteria, diversity metrics (vuln type, software, severity distribution, or exploit complexity), or confirmation that test environments match production deployments are provided. This directly undermines the general claim that the 87% rate reflects a property of one-day vulnerabilities rather than a curated sample.
Authors: We agree with the referee that the manuscript would benefit from more explicit details on how the dataset was constructed. In the revised manuscript, we will add to §3 a description of the selection criteria, which focused on vulnerabilities disclosed within the last year that have public CVE descriptions and affect commonly used software. We selected a mix of vulnerability types including remote code execution, SQL injection, and cross-site scripting to ensure diversity. We will also provide metrics on the distribution of severity levels and exploit complexity. Additionally, we will confirm that the test environments were set up to match production deployments using standard configurations from the software vendors. These changes will better support the generalizability of our 87% success rate to one-day vulnerabilities in real-world systems. revision: yes
-
Referee: §4 (Agent architecture and evaluation): The manuscript gives insufficient detail on exact agent architecture, tool access, prompting strategy, success criteria, number of trials, or controls for prompt-engineering variations. Without these, the 87% figure cannot be assessed for robustness or reproducibility.
Authors: We recognize that the current description in §4 is insufficient for full reproducibility. We will revise this section to provide comprehensive details on the agent architecture, including the specific tools available to the agent such as command execution and browsing capabilities. The prompting strategy will be described in full, including how the CVE description is integrated into the agent's instructions. We will define the success criteria clearly and report the number of trials conducted per vulnerability along with any measures taken to control for variations in prompting. These additions will allow readers to better evaluate the robustness of the 87% success rate. revision: yes
-
Referee: Baseline comparison (throughout §5): It is unclear whether ZAP and Metasploit were supplied equivalent CVE descriptions or run under the same environmental constraints as the LLM agent; the 0% result may therefore not constitute a fair head-to-head evaluation.
Authors: We thank the referee for pointing out this potential ambiguity in the baseline evaluation. The ZAP and Metasploit baselines were run in the exact same test environments as the LLM agent, with the vulnerable services deployed identically. However, these tools do not accept CVE descriptions as direct input; they rely on their internal vulnerability databases and scanning logic. In the revised manuscript, we will clarify this in §5 by detailing the configuration parameters used for each tool (e.g., ZAP's active scan on the target URL with specific policy settings, and Metasploit's use of relevant exploit modules matched to the CVE). We will also add text explaining that while the comparison is not identical in input format, it demonstrates the LLM agent's ability to leverage the CVE information effectively where traditional tools fail. This addresses the fairness concern while acknowledging the methodological differences. revision: partial
Circularity Check
No circularity: straightforward empirical comparison
full rationale
The paper reports direct experimental results from testing LLM agents on a fixed set of 15 one-day vulnerabilities, measuring success rates when provided CVE descriptions versus baselines (other models and scanners). No derivations, equations, fitted parameters, predictions, or self-citation chains are present that reduce any claim to its inputs by construction. The 87% figure is a measured outcome against external systems, not a tautology or renamed fit. The study is self-contained against its chosen benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Providing the official CVE description is a valid test of autonomous exploitation capability.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that LLM agents can autonomously exploit one-day vulnerabilities... GPT-4 is capable of exploiting 87%... ReAct agent framework... 91 lines of code
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks
APIOT is the first LLM framework to complete the full autonomous discovery-to-remediation cycle on bare-metal OT devices, reaching 90% success across 290 runs on Zephyr RTOS.
-
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
-
Agentic Vulnerability Reasoning on Windows COM Binaries
SLYP agentic pipeline discovers race condition vulnerabilities in Windows COM binaries and generates debugger-verified PoCs, scoring 0.973 F1 on a 40-case benchmark and finding 28 new confirmed vulnerabilities in prod...
-
PHANTOM: Polymorphic Honeytoken Adaptation with Narrative-Tailored Organisational Mimicry
PHANTOM raises honeytoken believability from 0.576 to 0.778 by adding organization-specific mimicry, lifting human acceptance to 100% and detection resistance to 0.870.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Taint-Style Vulnerability Detection and Confirmation for Node.js Packages Using LLM Agent Reasoning
LLMVD.js uses LLM agents to confirm 84% of taint-style vulnerabilities on public benchmarks (vs. <22% for prior tools) and generates validated exploits for 36 of 260 new packages (vs. ≤2 for traditional tools).
-
SoK: Honeypots & LLMs, More Than the Sum of Their Parts?
A systematization of knowledge paper that taxonomizes honeypot detection vectors, synthesizes LLM-honeypot literature into canonical architecture and evaluation methods, and proposes a roadmap for autonomous deception...
-
Patch2Vuln: Agentic Reconstruction of Vulnerabilities from Linux Distribution Binary Patches
An agentic pipeline localizes the security-relevant function in 10 of 20 Ubuntu binary security updates and produces an accepted root-cause classification in 11 of 20, limited mainly by binary differencing coverage.
-
Towards Optimal Agentic Architectures for Offensive Security Tasks
Empirical comparison of agentic topologies for offensive security shows MAS-Indep reaching 64.2% validated detection while simpler baselines remain competitive on efficiency, with whitebox and web targets outperformin...
-
An Independent Safety Evaluation of Kimi K2.5
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
-
From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software
RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.
-
A Multi-Agent Framework for Automated Exploit Generation with Constraint-Guided Comprehension and Reflection
Vulnsage, a multi-agent framework, generates 34.64% more exploits than prior tools and verified 146 zero-day vulnerabilities in real-world open-source libraries.
-
xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models
xOffense automates penetration testing via a fine-tuned Qwen3-32B LLM in a multi-agent setup with specialized agents for reconnaissance, vulnerability scanning, and exploitation, reporting 79.17% sub-task completion o...
-
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using esta...
-
Agentic AI and the Industrialization of Cyber Offense: Forecast, Consequences, and Defensive Priorities for Enterprises and the Mittelstand
Agentic AI lowers the cost and speed of cyber attacks, requiring immediate improvements in identity management, phishing-resistant authentication, patching, and agent governance for large enterprises and the Mittelstand.
-
CyberAId: AI-Driven Cybersecurity for Financial Service Providers
CyberAId is a proposed on-premise multi-agent system that coordinates LLM subagents with classical security tools to improve threat response and regulatory alignment in financial services.
-
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
-
Large Language Model-Based Agents for Software Engineering: A Survey
A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Emergent autonomous scientific research capabilities of large language models
Daniil A Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332,
-
[3]
Augmenting large language models with chemistry tools
Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew White, and Philippe Schwaller. Augmenting large language models with chemistry tools. In NeurIPS 2023 AI for Science Workshop,
work page 2023
-
[4]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[5]
Getting pwn’d by ai: Penetration testing with large language models
Andreas Happe and J ¨urgen Cito. Getting pwn’d by ai: Penetration testing with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 2082–2086,
work page 2082
-
[6]
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Dong Huang, Qingwen Bu, Jie M Zhang, Michael Luck, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Exploiting programmatic behavior of llms: Dual-use through standard security attacks
Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733,
-
[11]
Augmented Language Models: a Survey
Gr´egoire Mialon, Roberto Dess`ı, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozi`ere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
URL https://github.com/gpt-engineer-org/ gpt-engineer. Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victo- ria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, et al. Evaluating frontier models for dangerous capabilities. arXiv preprint arXiv:2403.13793,
-
[13]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Hen- derson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Automated vulnerability detection in source code using deep representation learning
Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. Automated vulnerability detection in source code using deep representation learning. In 2018 17th IEEE international conference on machine learning and applications (ICMLA), pp. 757–762. IEEE,
work page 2018
-
[15]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Llama 2: Open Foundation and Fine-Tuned Chat Models
URL https://huggingface.co/teknium/ OpenHermes-2.5-Mistral-7B . Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Openchat: Advancing open-source language models with mixed-quality data
Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. Open- chat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235,
-
[18]
Tdag: A multi-agent frame- work based on dynamic task decomposition and agent generation
Yaoxiang Wang, Zhiyong Wu, Junfeng Yao, and Jinsong Su. Tdag: A multi-agent frame- work based on dynamic task decomposition and agent generation. arXiv preprint arXiv:2402.10178,
-
[19]
Acidrain: Concurrency-related attacks on database- backed web applications
Todd Warszawski and Peter Bailis. Acidrain: Concurrency-related attacks on database- backed web applications. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 5–20,
work page 2017
-
[20]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682,
work page internal anchor Pith review Pith/arXiv arXiv
- [21]
-
[22]
ReAct: Synergizing Reasoning and Acting in Language Models
12 Preprint Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Benchmarking and defending against indirect prompt injection attacks on large language models
Jingwei Yi, Yueqi Xie, Bin Zhu, Keegan Hines, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv preprint arXiv:2312.14197,
-
[24]
Removing rlhf protections in gpt-4 via fine-tuning
Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, and Daniel Kang. Removing rlhf protections in gpt-4 via fine-tuning. arXiv preprint arXiv:2311.05553,
-
[25]
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. arXiv preprint arXiv:2403.02691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Path sensitive static analysis of web applications for remote code execution vulnerability detection
Yunhui Zheng and Xiangyu Zhang. Path sensitive static analysis of web applications for remote code execution vulnerability detection. In 2013 35th International Conference on Software Engineering (ICSE), pp. 652–661. IEEE,
work page 2013
-
[27]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.