Red-Teaming the Agentic Red-Team

Artem Sorokin; Dario Pasquini; Michal Bazyli; Taras Fedynyshyn

arxiv: 2606.24496 · v1 · pith:P5C3JZTDnew · submitted 2026-06-23 · 💻 cs.CR · cs.AI

Red-Teaming the Agentic Red-Team

Dario Pasquini , Michal Bazyli , Taras Fedynyshyn , Artem Sorokin This is my paper

Pith reviewed 2026-06-25 23:19 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords agentic systemsoffensive securityred teamingsecurity analysiscyber kill chainsandbox escapeAPI key exfiltrationdesign principles

0 comments

The pith

Agentic red-teaming tools share design flaws that let adversaries exfiltrate API keys and fully compromise the operator's machine despite sandboxing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts the first detailed security review of agentic systems built for offensive security work. It identifies shared weaknesses across the most common tools that allow an attacker to steal credentials, create persistent access, bypass safeguards, and break out of containers to control the host system. The authors map these weaknesses into a complete cyber kill chain that starts with manipulation of the language model and ends with sandbox escape. They then derive an improved overall architecture and a set of design rules meant to block the attack paths at the structural level.

Core claim

Most widely used agentic offensive-security tools contain common design flaws that permit an active adversary to exfiltrate API keys, establish persistent footholds, bypass guardrails, and escape sandboxed containers to achieve full compromise of the operator's machine; the authors support this by defining a dedicated cyber kill chain and by proposing an alternative architecture that removes the identified attack surfaces.

What carries the argument

A cyber kill chain for agentic offensive-security systems that sequences LLM manipulation, lateral movement, persistence, guardrail bypass, and sandbox escape.

If this is right

Sandbox containment alone does not protect the operator's machine from compromise by these tools.
API keys and other credentials must be isolated from the agent's runtime environment.
Guardrails can be bypassed as part of the documented attack progression.
A new architecture is required that addresses the full kill chain rather than individual stages.
The proposed design principles can be applied to reduce the attack surface at the architectural level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same category of flaws may appear in agentic systems used for tasks other than offensive security.
Operators of any LLM-driven automation that runs with elevated privileges face comparable risks if similar isolation patterns are used.
The kill-chain model could be adapted to analyze defensive or general-purpose agent deployments.
Testing frameworks that simulate the full chain would help validate whether proposed mitigations actually close the paths.

Load-bearing premise

The design flaws observed in the examined popular tools are representative of the broader category of agentic offensive-security systems.

What would settle it

An experiment in which one of the major agentic red-teaming tools is shown to block API-key exfiltration and container escape when subjected to the attack sequences described in the kill chain.

Figures

Figures reproduced from arXiv: 2606.24496 by Artem Sorokin, Dario Pasquini, Michal Bazyli, Taras Fedynyshyn.

**Figure 1.** Figure 1: Abstraction of the architecture of a agentic-redteam, and the parties involved in the threat model. such systems, based on the invariant of containment and least privilege, rather than on optimistic assumptions that AI-targeted attacks such as prompt injection can be prevented. 2 Preliminary This section provides the necessary background and outlines the setup of our study. 2.1 Agentic Offensive Security … view at source ↗

**Figure 2.** Figure 2: Kill chain for agentic-red-teams (first row), and intermediate adversarial objectives (second row). Dotted lines represent skipping connections conditionally enabled by the target architecture. adversarial objectives, spanning infrastructure abuse and secrets exfiltration. An overview of the kill chain is given in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The attacker stages a malicious payload on an [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Out-of-bounds write in pwcrypt. A missing combined bounds check allows a crafted vault’s metadata extension record to overflow metadata_buffer and overwrite the adjacent function-pointer table. When the parser subsequently invokes the per-version integrity hook, the corrupted pointer redirects execution to system, which receives the attacker-controlled KDF parameter string as its argument [PITH_FULL_IMAGE… view at source ↗

**Figure 5.** Figure 5: An attacker who already holds code execution on the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Representation of the proposed secure architecture for [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

The use of agentic systems to perform offensive security operations has moved from a theoretical possibility to a commoditized capability. However, while the community has focused on creating more and more capable agents, less attention has been allocated to assessing the security of those systems. In this work, we present the first in-depth security analysis of the most widely used agentic systems for offensive security operations. We show that most of these tools share common design flaws that enable an active adversary to exfiltrate API keys, establish persistent footholds, and fully compromise the operator's machine, even when the agent operates inside a sandboxed container. To support our analysis, we introduce a full cyber kill chain for such agentic systems, capturing the progression from initial LLM manipulation to lateral movement, persistence, guardrail bypass, and sandbox escape. Building on our security analysis, we derive a robust architecture for agentic offensive-security tools and propose actionable, broadly applicable design principles that mitigate the disclosed attack paths at the architectural level.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags design flaws in agentic offensive-security tools and offers a kill chain plus architecture fixes, but provides no list of examined tools or methods to support the 'most widely used' claim.

read the letter

The main thing here is that the authors say most common agentic systems for red-teaming let an adversary steal API keys, plant persistence, and escape sandboxes, and they lay out a kill chain from LLM manipulation onward plus some design principles to block those paths.

What is new is the dedicated cyber kill chain tailored to these agentic setups and the move from analysis to concrete architecture recommendations. That framing is not in the earlier literature they cite. The paper also correctly notes that security of the tools themselves has received less attention than their offensive capabilities.

It does a service by drawing attention to a practical gap as these systems become commodity items. If the attacks are real and the fixes hold, the work could matter for anyone deploying or building autonomous security agents.

The soft spots are clear. The central claim rests on 'the most widely used' tools sharing the flaws, yet the abstract gives no names, no selection criteria, and no usage metrics. That leaves the generalization unsupported, exactly as the stress-test note flags. There is also no methodology section visible, so it is impossible to tell whether the attacks were demonstrated in practice or remain hypothetical. Soundness cannot be assessed from what is here.

This is for security researchers and developers who work with autonomous agents in offensive contexts. A reader looking for timely discussion of agent safety in red-teaming would get value from the framing, but anyone needing verifiable data or reproducible steps will find it thin.

I would send it to peer review. The topic is timely enough that referees could push for the missing tool list, selection details, and test methodology, and the authors could address those without starting over.

Referee Report

1 major / 0 minor

Summary. The manuscript claims to deliver the first in-depth security analysis of the most widely used agentic systems for offensive security operations. It asserts that most such tools share design flaws enabling an active adversary to exfiltrate API keys, establish persistent footholds, and fully compromise the operator's machine even inside sandboxed containers. The work introduces a full cyber kill chain (LLM manipulation through lateral movement, persistence, guardrail bypass, and sandbox escape) and derives a robust architecture plus actionable design principles to mitigate the identified attack paths.

Significance. If the examined tools prove representative and the analysis is supported by concrete evidence, the result would be significant for the security of emerging agentic systems in offensive operations. It would supply the first structured kill-chain model for this domain and architectural-level mitigations, potentially influencing tool design and deployment practices.

major comments (1)

[Abstract] Abstract: The central claim that the work analyzes 'the most widely used agentic systems' and that 'most of these tools share common design flaws' is not supported by any enumeration of examined tools, selection criteria, or usage/popularity metrics. This absence directly undermines the generalization from the sampled systems to the category as a whole and is load-bearing for the paper's primary contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive comment. We address the concern regarding support for the abstract's claims below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the work analyzes 'the most widely used agentic systems' and that 'most of these tools share common design flaws' is not supported by any enumeration of examined tools, selection criteria, or usage/popularity metrics. This absence directly undermines the generalization from the sampled systems to the category as a whole and is load-bearing for the paper's primary contribution.

Authors: We agree that the abstract's phrasing requires additional grounding to support the generalization. The full manuscript examines a set of representative agentic offensive-security tools, but the abstract does not enumerate them or detail selection criteria. In the revised version, we will expand the abstract (or add a brief supporting paragraph in the introduction) to list the specific tools analyzed, describe the selection criteria (prominence in public offensive-security repositories, community discussions, and documented usage in red-team operations), and reference available popularity indicators where they exist. This revision will make the scope and basis for the claims explicit without altering the core empirical findings or the derived kill chain and architecture. revision: yes

Circularity Check

0 steps flagged

Empirical security analysis of external tools; no circular derivation

full rationale

The paper performs a direct security analysis of existing agentic offensive-security systems, identifies design flaws through examination, introduces a kill chain based on observed attack paths, and derives mitigation principles from that analysis. The abstract and description contain no equations, fitted parameters, self-definitional constructs, or load-bearing self-citations that reduce any claim to the paper's own inputs by construction. Claims rest on external tool behavior rather than internal redefinition or renaming of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; the work appears to rest on empirical observation of existing tools rather than new theoretical constructs.

pith-pipeline@v0.9.1-grok · 5707 in / 1153 out tokens · 31324 ms · 2026-06-25T23:19:11.594451+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 4 linked inside Pith

[1]

Ai agents may always fall for prompt injections.arXiv preprint arXiv:2605.17634, 2026

Sahar Abdelnabi and Eugene Bagdasarian. Ai agents may always fall for prompt injections.arXiv preprint arXiv:2605.17634, 2026

Pith/arXiv arXiv 2026
[2]

Ai pentesting (aipentest): Autonomous ai attack simulation

Aikido Security. Ai pentesting (aipentest): Autonomous ai attack simulation. https://www.aikido.dev/ attack/aipentest, 2026. Accessed: 2026-06-18

2026
[3]

Alias1: Cybersecurity ai and robot secu- rity platform

Alias Robotics. Alias1: Cybersecurity ai and robot secu- rity platform. https://aliasrobotics.com/alias1. php, 2026. Accessed: 2026-06-18

2026
[4]

Opencode: The open source coding agent,

Anomaly. Opencode: The open source coding agent,
[5]

Accessed: 2026-06-22

GitHub repository. Accessed: 2026-06-22

2026
[6]

Disrupting the first reported ai-orchestrated cyber espionage campaign

Anthropic. Disrupting the first reported ai-orchestrated cyber espionage campaign. https://www.anthropic. com/news/disrupting-AI-espionage, 2025. Pub- lished: November 2025. Accessed: 2026-06-18

2025
[7]

sandbox-runtime: A lightweight sandboxing tool for enforcing filesystem and network restrictions on arbitrary processes at the os level

Anthropic. sandbox-runtime: A lightweight sandboxing tool for enforcing filesystem and network restrictions on arbitrary processes at the os level. https://github. com/anthropic-experimental/sandbox-runtime,
[8]

GitHub repository, accessed 2026-06-15

2026
[9]

Deep reinforcement fuzzing

Konstantin Böttinger, Patrice Godefroid, and Rishabh Singh. Deep reinforcement fuzzing. In2018 IEEE Security and Privacy Workshops (SPW), pages 116–122. IEEE, 2018

2018
[10]

Mihai Christodorescu, Earlence Fernandes, Ashish Hooda, Somesh Jha, Johann Rehberger, Kamalika Chaudhuri, Xiaohan Fu, Khawaja Shams, Guy Amir, Jihye Choi, Sarthak Choudhary, Nils Palumbo, Andrey Labunets, and Nishit V . Pandya. Agent security is a systems problem, 2026

2026
[11]

Cracken: Ai-enabled adversarial exposure val- idation platform

Cracken. Cracken: Ai-enabled adversarial exposure val- idation platform. https://cracken.ai/, 2026. Ac- cessed: 2026-06-18

2026
[12]

{PentestGPT}: Evaluating and harnessing large language models for automated penetration testing

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Sympo- sium (USENIX Security 24), pages 847–864, 2024

2024
[13]

Mem- ory injection attacks on llm agents via query-only in- teraction.Advances in Neural Information Processing Systems, 38:46697–46731, 2026

Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. Mem- ory injection attacks on llm agents via query-only in- teraction.Advances in Neural Information Processing Systems, 38:46697–46731, 2026

2026
[14]

Llm agents can autonomously exploit one-day vulnerabilities, 2024

Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. Llm agents can autonomously exploit one-day vulnerabilities, 2024

2024
[15]

Llm agents can autonomously hack websites, 2024

Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang. Llm agents can autonomously hack websites, 2024. 18

2024
[16]

catastrophic failure

Fortune. Ai-powered coding tool replit wiped database, called it a "catastrophic failure".Fortune, July 2025

2025
[17]

Human-inspired episodic memory for infinite context LLMs

Zafeirios Fountas, Martin Benfeghoul, Adnan Oomer- jee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou Ammar, and Jun Wang. Human-inspired episodic memory for infinite context LLMs. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[18]

Ai vulnerability exploitation for initial access, 2026

Google Threat Intelligence Group. Ai vulnerability exploitation for initial access, 2026. Google Cloud Blog, accessed 2026-06-17

2026
[19]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

2023
[20]

Getting pwn’d by ai: Penetration testing with large language models

Andreas Happe and Jürgen Cito. Getting pwn’d by ai: Penetration testing with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foun- dations of Software Engineering, ESEC/FSE ’23. ACM, November 2023

2023
[21]

Freedberg Jr

Sydney J. Freedberg Jr. Army plans fast follow- up to ai cyber wargame with industry: Officials. https://breakingdefense.com/2026/05/army-plans- fast-follow-up-to-ai-cyber-wargame-with-industry- officials/, 2026. Breaking Defense, published May

2026
[22]

Accessed: 2026-06-18

2026
[23]

Del Rosario, and Timothy Ly- nar

Ben Kereopa-Yorke, Guillermo Diaz, Holly Wright, Rea- gan Johnston, Ron F. Del Rosario, and Timothy Ly- nar. Oracle poisoning: Corrupting knowledge graphs to weaponise ai agent reasoning, 2026

2026
[24]

Locked shields 2026: 41 nations strengthen cyber resilience in world’s biggest exercise, April 2026

Eduard Kovacs. Locked shields 2026: 41 nations strengthen cyber resilience in world’s biggest exercise, April 2026

2026
[25]

Aligned llms are not aligned browser agents

Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Elaine Chang, Vaughn Robinson, Shuyan Zhou, Matt Fredrikson, Sean Hendryx, Summer Yue, and Zifan Wang. Aligned llms are not aligned browser agents. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Repre- sentations, volume 2025, pages 26755–26776, 2025

2025
[26]

Takedown: How it’s done in modern coding agent exploits.arXiv preprint arXiv:2509.24240, 2025

Eunkyu Lee, Donghyeon Kim, Wonyoung Kim, and Insu Yun. Takedown: How it’s done in modern coding agent exploits.arXiv preprint arXiv:2509.24240, 2025

arXiv 2025
[27]

Comparing ai agents to cybersecurity professionals in real-world penetration testing.arXiv preprint arXiv:2512.09882, 2025

Justin W Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu, Arnold Tianyi Yang, Neil Perry, Andy Zou, Matt Fredrikson, J Zico Kolter, et al. Comparing ai agents to cybersecurity professionals in real-world penetration testing.arXiv preprint arXiv:2512.09882, 2025

arXiv 2025
[28]

Synthe- sizing multi-agent harnesses for vulnerability discovery

Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen, Yanju Chen, Ryan Jingyang Fang, and Yu Feng. Synthe- sizing multi-agent harnesses for vulnerability discovery. arXiv preprint arXiv:2604.20801, 2026

Pith/arXiv arXiv 2026
[29]

Fast, lean, and accurate: Modeling password guessability using neural networks

William Melicher, Blase Ur, Sean M Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin, and Lor- rie Faith Cranor. Fast, lean, and accurate: Modeling password guessability using neural networks. In25th USENIX Security Symposium (USENIX Security 16), pages 175–191, 2016

2016
[30]

Invitation is all you need! promptware attacks against llm-powered as- sistants in production are practical and dangerous.arXiv preprint arXiv:2508.12175, 2025

Ben Nassi, Stav Cohen, and Or Yair. Invitation is all you need! promptware attacks against llm-powered as- sistants in production are practical and dangerous.arXiv preprint arXiv:2508.12175, 2025

arXiv 2025
[31]

CVE-2022- 0185 Detail

National Vulnerability Database (NVD). CVE-2022- 0185 Detail. https://nvd.nist.gov/vuln/detail/ CVE-2022-0185, 2022. Accessed: 2026-06-08

2022
[32]

Novee security: Ai penetration test- ing platform

Novee Security. Novee security: Ai penetration test- ing platform. https://novee.security/, 2026. Ac- cessed: 2026-06-18

2026
[33]

Universal neural-cracking-machines: Self- configurable password models from auxiliary data

Dario Pasquini, Giuseppe Ateniese, and Carmela Tron- coso. Universal neural-cracking-machines: Self- configurable password models from auxiliary data. In 2024 IEEE Symposium on Security and Privacy (SP), pages 1365–1384. IEEE, 2024

2024
[34]

Improving pass- word guessing via representation learning

Dario Pasquini, Ankit Gangwal, Giuseppe Ateniese, Massimo Bernaschi, and Mauro Conti. Improving pass- word guessing via representation learning. In2021 IEEE Symposium on Security and Privacy (SP), pages 1382–1399. IEEE, 2021

2021
[35]

Kornaropoulos, and Giuseppe Ateniese

Dario Pasquini, Evgenios M. Kornaropoulos, and Giuseppe Ateniese. Hacking back the ai-hacker: Prompt injection as a defense against llm-driven cyberattacks, 2024

2024
[36]

Kornaropoulos, Giuseppe Ateniese, Omer Akgul, Athanasios Theocharis, and Pet- ros Efstathopoulos

Dario Pasquini, Evgenios M. Kornaropoulos, Giuseppe Ateniese, Omer Akgul, Athanasios Theocharis, and Pet- ros Efstathopoulos. When aiops become "ai oops": Sub- verting llm-driven it operations via telemetry manipula- tion, 2025. 19

2025
[37]

Neural exec: Learning (and learning from) execu- tion triggers for prompt injection attacks

Dario Pasquini, Martin Strohmeier, and Carmela Tron- coso. Neural exec: Learning (and learning from) execu- tion triggers for prompt injection attacks. InProceed- ings of the 2024 Workshop on Artificial Intelligence and Security, pages 89–100, 2024

2024
[38]

Army awards seekr contract for agentic ai to identify cyber vulnerabilities

Dan Schere. Army awards seekr contract for agentic ai to identify cyber vulnerabilities. https://insidedefense.com/insider/army-awards-seekr- contract-agentic-ai-identify-cyber-vulnerabilities, 2026. Inside Defense, published January 2026. Accessed: 2026-06-18

2026
[39]

An empirical evaluation of llms for solving offensive security challenges

Minghao Shao, Boyuan Chen, Sofija Jancheska, Bren- dan Dolan-Gavitt, Siddharth Garg, Ramesh Karri, and Muhammad Shafique. An empirical evaluation of llms for solving offensive security challenges. 2024

2024
[40]

A survey on ma- chine learning techniques for cyber security in the last decade.IEEE access, 8:222310–222354, 2020

Kamran Shaukat, Suhuai Luo, Vijay Varadharajan, Ibrahim A Hameed, and Min Xu. A survey on ma- chine learning techniques for cyber security in the last decade.IEEE access, 8:222310–222354, 2020

2020
[41]

Cybergym-e2e: Scalable real-world benchmark for ai agents’ end-to-end cybersecurity capabilities.arXiv preprint arXiv:2606.04460, 2026

Tianneng Shi, Robin Rheem, Dongwei Jiang, Mona Wang, Francisco De La Riega, Zhun Wang, Jingzhi Jiang, Alexander Cheung, Sean Tai, Jonah Cha, et al. Cybergym-e2e: Scalable real-world benchmark for ai agents’ end-to-end cybersecurity capabilities.arXiv preprint arXiv:2606.04460, 2026

Pith/arXiv arXiv 2026
[42]

Incalmo: An au- tonomous llm-assisted system for red teaming multi- host networks.arXiv preprint arXiv:2501.16466, 2025

Brian Singer, Keane Lucas, Lakshmi Adiga, Meghna Jain, Lujo Bauer, and Vyas Sekar. Incalmo: An au- tonomous llm-assisted system for red teaming multi- host networks.arXiv preprint arXiv:2501.16466, 2025

arXiv 2025
[43]

Mariami Tkeshelashvili, Ritika Verma, and Steven M. Kelly. Ai loss of control risk: Indications & warning, February 2026

2026
[44]

From sands to mansions: Enabling auto- matic full-life-cycle cyberattack construction with llm

Lingzhi Wang, Jiahui Wang, Kyle Jung, Kedar Thia- garajan, Emily Wei, Xiangmin Shen, Yan Chen, and Zhenyuan Li. From sands to mansions: Enabling auto- matic full-life-cycle cyberattack construction with llm. arXiv preprint arXiv:2407.16928, 2024

arXiv 2024
[45]

Ex- ploitgym: Can ai agents turn security vulnerabilities into real attacks?arXiv preprint arXiv:2605.11086, 2026

Zhun Wang, Nico Schiller, Hongwei Li, Srijiith Sesha Narayana, Milad Nasr, Nicholas Carlini, Xiangyu Qi, Eric Wallace, Elie Bursztein, Luca Invernizzi, et al. Ex- ploitgym: Can ai agents turn security vulnerabilities into real attacks?arXiv preprint arXiv:2605.11086, 2026

Pith/arXiv arXiv 2026
[46]

Introducing the wiz red agent: Ai- powered attacker

Wiz. Introducing the wiz red agent: Ai- powered attacker. https://www.wiz.io/blog/ introducing-the-wiz-red-agent , 2026. Pub- lished: March 2026. Accessed: 2026-06-18

2026
[47]

Xbow: Autonomous offensive security plat- form

XBOW. Xbow: Autonomous offensive security plat- form. https://xbow.com/, 2026. Accessed: 2026-06- 18. Table A.1: List of guardrails implemented by the agents under consideration. Agent Binary allowlist Command denylist Shell-parse check Encoding decode Scope check Prompt-inj. detection AIRecon✓ CAI✓ ✓ DarkMoon✓ ✓ METATRON✓ PentestAgent✓ RedAmon✓ xalgorix✓ ✓

2026
[48]

Stokes, Geoff McDonald, Xuesong Bai, David Marshall, Siyue Wang, Adith Swaminathan, and Zhou Li

Jiacen Xu, Jack W. Stokes, Geoff McDonald, Xuesong Bai, David Marshall, Siyue Wang, Adith Swaminathan, and Zhou Li. Autoattacker: A large language model guided system to implement automatic cyber-attacks, 2024

2024
[49]

Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel Ho, and Percy Liang

Andy Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Wang, Junrong Wu, Kyleen Liao, Jil- iang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shi- vatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai- Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Ye, Prerit Choud-...

2025
[50]

Teams of llm agents can exploit zero-day vulnerabilities

Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. Teams of llm agents can exploit zero-day vulnerabilities. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 23–35, 2026

2026
[51]

{FuzzGuard}: Filter- ing out unreachable inputs in directed grey-box fuzzing through deep learning

Peiyuan Zong, Tao Lv, Dawei Wang, Zizhuang Deng, Ruigang Liang, and Kai Chen. {FuzzGuard}: Filter- ing out unreachable inputs in directed grey-box fuzzing through deep learning. In29th USENIX security sympo- sium (USENIX security 20), pages 2255–2269, 2020. Appendix A Additional material Table A.2 lists the source and version of the agents tested in the p...

2020
[52]

trusted con- tent

The invariant that is never checked is the combined one: subtype + value length ≤ 256. A record with subtype 255 and value length 49 satisfies all three explicit checks, yet the resulting write begins one byte before the buffer’s end and extends 48 bytes beyond it. BSS layout and function-pointer corruptionIn the linked binary, the 256-byte metadata buffe...

[1] [1]

Ai agents may always fall for prompt injections.arXiv preprint arXiv:2605.17634, 2026

Sahar Abdelnabi and Eugene Bagdasarian. Ai agents may always fall for prompt injections.arXiv preprint arXiv:2605.17634, 2026

Pith/arXiv arXiv 2026

[2] [2]

Ai pentesting (aipentest): Autonomous ai attack simulation

Aikido Security. Ai pentesting (aipentest): Autonomous ai attack simulation. https://www.aikido.dev/ attack/aipentest, 2026. Accessed: 2026-06-18

2026

[3] [3]

Alias1: Cybersecurity ai and robot secu- rity platform

Alias Robotics. Alias1: Cybersecurity ai and robot secu- rity platform. https://aliasrobotics.com/alias1. php, 2026. Accessed: 2026-06-18

2026

[4] [4]

Opencode: The open source coding agent,

Anomaly. Opencode: The open source coding agent,

[5] [5]

Accessed: 2026-06-22

GitHub repository. Accessed: 2026-06-22

2026

[6] [6]

Disrupting the first reported ai-orchestrated cyber espionage campaign

Anthropic. Disrupting the first reported ai-orchestrated cyber espionage campaign. https://www.anthropic. com/news/disrupting-AI-espionage, 2025. Pub- lished: November 2025. Accessed: 2026-06-18

2025

[7] [7]

sandbox-runtime: A lightweight sandboxing tool for enforcing filesystem and network restrictions on arbitrary processes at the os level

Anthropic. sandbox-runtime: A lightweight sandboxing tool for enforcing filesystem and network restrictions on arbitrary processes at the os level. https://github. com/anthropic-experimental/sandbox-runtime,

[8] [8]

GitHub repository, accessed 2026-06-15

2026

[9] [9]

Deep reinforcement fuzzing

Konstantin Böttinger, Patrice Godefroid, and Rishabh Singh. Deep reinforcement fuzzing. In2018 IEEE Security and Privacy Workshops (SPW), pages 116–122. IEEE, 2018

2018

[10] [10]

Mihai Christodorescu, Earlence Fernandes, Ashish Hooda, Somesh Jha, Johann Rehberger, Kamalika Chaudhuri, Xiaohan Fu, Khawaja Shams, Guy Amir, Jihye Choi, Sarthak Choudhary, Nils Palumbo, Andrey Labunets, and Nishit V . Pandya. Agent security is a systems problem, 2026

2026

[11] [11]

Cracken: Ai-enabled adversarial exposure val- idation platform

Cracken. Cracken: Ai-enabled adversarial exposure val- idation platform. https://cracken.ai/, 2026. Ac- cessed: 2026-06-18

2026

[12] [12]

{PentestGPT}: Evaluating and harnessing large language models for automated penetration testing

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Sympo- sium (USENIX Security 24), pages 847–864, 2024

2024

[13] [13]

Mem- ory injection attacks on llm agents via query-only in- teraction.Advances in Neural Information Processing Systems, 38:46697–46731, 2026

Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. Mem- ory injection attacks on llm agents via query-only in- teraction.Advances in Neural Information Processing Systems, 38:46697–46731, 2026

2026

[14] [14]

Llm agents can autonomously exploit one-day vulnerabilities, 2024

Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. Llm agents can autonomously exploit one-day vulnerabilities, 2024

2024

[15] [15]

Llm agents can autonomously hack websites, 2024

Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang. Llm agents can autonomously hack websites, 2024. 18

2024

[16] [16]

catastrophic failure

Fortune. Ai-powered coding tool replit wiped database, called it a "catastrophic failure".Fortune, July 2025

2025

[17] [17]

Human-inspired episodic memory for infinite context LLMs

Zafeirios Fountas, Martin Benfeghoul, Adnan Oomer- jee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou Ammar, and Jun Wang. Human-inspired episodic memory for infinite context LLMs. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[18] [18]

Ai vulnerability exploitation for initial access, 2026

Google Threat Intelligence Group. Ai vulnerability exploitation for initial access, 2026. Google Cloud Blog, accessed 2026-06-17

2026

[19] [19]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

2023

[20] [20]

Getting pwn’d by ai: Penetration testing with large language models

Andreas Happe and Jürgen Cito. Getting pwn’d by ai: Penetration testing with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foun- dations of Software Engineering, ESEC/FSE ’23. ACM, November 2023

2023

[21] [21]

Freedberg Jr

Sydney J. Freedberg Jr. Army plans fast follow- up to ai cyber wargame with industry: Officials. https://breakingdefense.com/2026/05/army-plans- fast-follow-up-to-ai-cyber-wargame-with-industry- officials/, 2026. Breaking Defense, published May

2026

[22] [22]

Accessed: 2026-06-18

2026

[23] [23]

Del Rosario, and Timothy Ly- nar

Ben Kereopa-Yorke, Guillermo Diaz, Holly Wright, Rea- gan Johnston, Ron F. Del Rosario, and Timothy Ly- nar. Oracle poisoning: Corrupting knowledge graphs to weaponise ai agent reasoning, 2026

2026

[24] [24]

Locked shields 2026: 41 nations strengthen cyber resilience in world’s biggest exercise, April 2026

Eduard Kovacs. Locked shields 2026: 41 nations strengthen cyber resilience in world’s biggest exercise, April 2026

2026

[25] [25]

Aligned llms are not aligned browser agents

Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Elaine Chang, Vaughn Robinson, Shuyan Zhou, Matt Fredrikson, Sean Hendryx, Summer Yue, and Zifan Wang. Aligned llms are not aligned browser agents. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Repre- sentations, volume 2025, pages 26755–26776, 2025

2025

[26] [26]

Takedown: How it’s done in modern coding agent exploits.arXiv preprint arXiv:2509.24240, 2025

Eunkyu Lee, Donghyeon Kim, Wonyoung Kim, and Insu Yun. Takedown: How it’s done in modern coding agent exploits.arXiv preprint arXiv:2509.24240, 2025

arXiv 2025

[27] [27]

Comparing ai agents to cybersecurity professionals in real-world penetration testing.arXiv preprint arXiv:2512.09882, 2025

Justin W Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu, Arnold Tianyi Yang, Neil Perry, Andy Zou, Matt Fredrikson, J Zico Kolter, et al. Comparing ai agents to cybersecurity professionals in real-world penetration testing.arXiv preprint arXiv:2512.09882, 2025

arXiv 2025

[28] [28]

Synthe- sizing multi-agent harnesses for vulnerability discovery

Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen, Yanju Chen, Ryan Jingyang Fang, and Yu Feng. Synthe- sizing multi-agent harnesses for vulnerability discovery. arXiv preprint arXiv:2604.20801, 2026

Pith/arXiv arXiv 2026

[29] [29]

Fast, lean, and accurate: Modeling password guessability using neural networks

William Melicher, Blase Ur, Sean M Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin, and Lor- rie Faith Cranor. Fast, lean, and accurate: Modeling password guessability using neural networks. In25th USENIX Security Symposium (USENIX Security 16), pages 175–191, 2016

2016

[30] [30]

Invitation is all you need! promptware attacks against llm-powered as- sistants in production are practical and dangerous.arXiv preprint arXiv:2508.12175, 2025

Ben Nassi, Stav Cohen, and Or Yair. Invitation is all you need! promptware attacks against llm-powered as- sistants in production are practical and dangerous.arXiv preprint arXiv:2508.12175, 2025

arXiv 2025

[31] [31]

CVE-2022- 0185 Detail

National Vulnerability Database (NVD). CVE-2022- 0185 Detail. https://nvd.nist.gov/vuln/detail/ CVE-2022-0185, 2022. Accessed: 2026-06-08

2022

[32] [32]

Novee security: Ai penetration test- ing platform

Novee Security. Novee security: Ai penetration test- ing platform. https://novee.security/, 2026. Ac- cessed: 2026-06-18

2026

[33] [33]

Universal neural-cracking-machines: Self- configurable password models from auxiliary data

Dario Pasquini, Giuseppe Ateniese, and Carmela Tron- coso. Universal neural-cracking-machines: Self- configurable password models from auxiliary data. In 2024 IEEE Symposium on Security and Privacy (SP), pages 1365–1384. IEEE, 2024

2024

[34] [34]

Improving pass- word guessing via representation learning

Dario Pasquini, Ankit Gangwal, Giuseppe Ateniese, Massimo Bernaschi, and Mauro Conti. Improving pass- word guessing via representation learning. In2021 IEEE Symposium on Security and Privacy (SP), pages 1382–1399. IEEE, 2021

2021

[35] [35]

Kornaropoulos, and Giuseppe Ateniese

Dario Pasquini, Evgenios M. Kornaropoulos, and Giuseppe Ateniese. Hacking back the ai-hacker: Prompt injection as a defense against llm-driven cyberattacks, 2024

2024

[36] [36]

Kornaropoulos, Giuseppe Ateniese, Omer Akgul, Athanasios Theocharis, and Pet- ros Efstathopoulos

Dario Pasquini, Evgenios M. Kornaropoulos, Giuseppe Ateniese, Omer Akgul, Athanasios Theocharis, and Pet- ros Efstathopoulos. When aiops become "ai oops": Sub- verting llm-driven it operations via telemetry manipula- tion, 2025. 19

2025

[37] [37]

Neural exec: Learning (and learning from) execu- tion triggers for prompt injection attacks

Dario Pasquini, Martin Strohmeier, and Carmela Tron- coso. Neural exec: Learning (and learning from) execu- tion triggers for prompt injection attacks. InProceed- ings of the 2024 Workshop on Artificial Intelligence and Security, pages 89–100, 2024

2024

[38] [38]

Army awards seekr contract for agentic ai to identify cyber vulnerabilities

Dan Schere. Army awards seekr contract for agentic ai to identify cyber vulnerabilities. https://insidedefense.com/insider/army-awards-seekr- contract-agentic-ai-identify-cyber-vulnerabilities, 2026. Inside Defense, published January 2026. Accessed: 2026-06-18

2026

[39] [39]

An empirical evaluation of llms for solving offensive security challenges

Minghao Shao, Boyuan Chen, Sofija Jancheska, Bren- dan Dolan-Gavitt, Siddharth Garg, Ramesh Karri, and Muhammad Shafique. An empirical evaluation of llms for solving offensive security challenges. 2024

2024

[40] [40]

A survey on ma- chine learning techniques for cyber security in the last decade.IEEE access, 8:222310–222354, 2020

Kamran Shaukat, Suhuai Luo, Vijay Varadharajan, Ibrahim A Hameed, and Min Xu. A survey on ma- chine learning techniques for cyber security in the last decade.IEEE access, 8:222310–222354, 2020

2020

[41] [41]

Cybergym-e2e: Scalable real-world benchmark for ai agents’ end-to-end cybersecurity capabilities.arXiv preprint arXiv:2606.04460, 2026

Tianneng Shi, Robin Rheem, Dongwei Jiang, Mona Wang, Francisco De La Riega, Zhun Wang, Jingzhi Jiang, Alexander Cheung, Sean Tai, Jonah Cha, et al. Cybergym-e2e: Scalable real-world benchmark for ai agents’ end-to-end cybersecurity capabilities.arXiv preprint arXiv:2606.04460, 2026

Pith/arXiv arXiv 2026

[42] [42]

Incalmo: An au- tonomous llm-assisted system for red teaming multi- host networks.arXiv preprint arXiv:2501.16466, 2025

Brian Singer, Keane Lucas, Lakshmi Adiga, Meghna Jain, Lujo Bauer, and Vyas Sekar. Incalmo: An au- tonomous llm-assisted system for red teaming multi- host networks.arXiv preprint arXiv:2501.16466, 2025

arXiv 2025

[43] [43]

Mariami Tkeshelashvili, Ritika Verma, and Steven M. Kelly. Ai loss of control risk: Indications & warning, February 2026

2026

[44] [44]

From sands to mansions: Enabling auto- matic full-life-cycle cyberattack construction with llm

Lingzhi Wang, Jiahui Wang, Kyle Jung, Kedar Thia- garajan, Emily Wei, Xiangmin Shen, Yan Chen, and Zhenyuan Li. From sands to mansions: Enabling auto- matic full-life-cycle cyberattack construction with llm. arXiv preprint arXiv:2407.16928, 2024

arXiv 2024

[45] [45]

Ex- ploitgym: Can ai agents turn security vulnerabilities into real attacks?arXiv preprint arXiv:2605.11086, 2026

Zhun Wang, Nico Schiller, Hongwei Li, Srijiith Sesha Narayana, Milad Nasr, Nicholas Carlini, Xiangyu Qi, Eric Wallace, Elie Bursztein, Luca Invernizzi, et al. Ex- ploitgym: Can ai agents turn security vulnerabilities into real attacks?arXiv preprint arXiv:2605.11086, 2026

Pith/arXiv arXiv 2026

[46] [46]

Introducing the wiz red agent: Ai- powered attacker

Wiz. Introducing the wiz red agent: Ai- powered attacker. https://www.wiz.io/blog/ introducing-the-wiz-red-agent , 2026. Pub- lished: March 2026. Accessed: 2026-06-18

2026

[47] [47]

Xbow: Autonomous offensive security plat- form

XBOW. Xbow: Autonomous offensive security plat- form. https://xbow.com/, 2026. Accessed: 2026-06- 18. Table A.1: List of guardrails implemented by the agents under consideration. Agent Binary allowlist Command denylist Shell-parse check Encoding decode Scope check Prompt-inj. detection AIRecon✓ CAI✓ ✓ DarkMoon✓ ✓ METATRON✓ PentestAgent✓ RedAmon✓ xalgorix✓ ✓

2026

[48] [48]

Stokes, Geoff McDonald, Xuesong Bai, David Marshall, Siyue Wang, Adith Swaminathan, and Zhou Li

Jiacen Xu, Jack W. Stokes, Geoff McDonald, Xuesong Bai, David Marshall, Siyue Wang, Adith Swaminathan, and Zhou Li. Autoattacker: A large language model guided system to implement automatic cyber-attacks, 2024

2024

[49] [49]

Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel Ho, and Percy Liang

Andy Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Wang, Junrong Wu, Kyleen Liao, Jil- iang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shi- vatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai- Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Ye, Prerit Choud-...

2025

[50] [50]

Teams of llm agents can exploit zero-day vulnerabilities

Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. Teams of llm agents can exploit zero-day vulnerabilities. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 23–35, 2026

2026

[51] [51]

{FuzzGuard}: Filter- ing out unreachable inputs in directed grey-box fuzzing through deep learning

Peiyuan Zong, Tao Lv, Dawei Wang, Zizhuang Deng, Ruigang Liang, and Kai Chen. {FuzzGuard}: Filter- ing out unreachable inputs in directed grey-box fuzzing through deep learning. In29th USENIX security sympo- sium (USENIX security 20), pages 2255–2269, 2020. Appendix A Additional material Table A.2 lists the source and version of the agents tested in the p...

2020

[52] [52]

trusted con- tent

The invariant that is never checked is the combined one: subtype + value length ≤ 256. A record with subtype 255 and value length 49 satisfies all three explicit checks, yet the resulting write begins one byte before the buffer’s end and extends 48 bytes beyond it. BSS layout and function-pointer corruptionIn the linked binary, the 256-byte metadata buffe...