arxiv: 2605.11891 · v1 · submitted 2026-05-12 · 💻 cs.CR · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

Zhaojiacheng Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:26 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords agent skillsred teamingadaptive attacksLLM agentsskill vettingcybersecurityLLM securityself-evolving systems

0 comments

The pith

Current vetting of agent skills underestimates residual risk from adaptive attackers who iteratively refine them using feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agent skills extend LLM agents with reusable instructions, tools, and code that users install from third-party sources, exposing both executable behavior and documentation that single-shot audits cannot fully secure. The paper frames the core risk as adaptive leakage, in which a budgeted attacker repeatedly rewrites a skill based on audit findings and runtime evidence until it evades detection while still producing verified harm. Proteus implements this by searching a formalized five-axis skill-attack space through a unified grey-box pipeline that returns structured feedback to drive mutations, path expansion, and surface expansion. The evaluation shows 40-90 percent attack success within five rounds and hundreds of jointly bypassing lethal variants, demonstrating that static vetting leaves substantial unaddressed exposure in agent skill ecosystems.

Core claim

Proteus is a grey-box self-evolving red-team framework that measures adaptive leakage by searching a formalized five-axis skill-attack space, evaluating candidates through an audit-sandbox-oracle pipeline, and applying cross-round mutations plus path and surface expansion to generate alternative implementations and transfer patterns to new objectives, reaching 40-90 percent ASR@5 with positive learning curves and producing 438 jointly bypassing lethal variants that bypass SkillVetter at over 93 percent and AI-Infra-Guard at up to 41.3 percent joint success.

What carries the argument

The adaptive leakage definition together with Proteus's grey-box self-evolving pipeline, which unifies structured audit findings and runtime evidence to guide iterative mutation across a five-axis attack space.

If this is right

Skill vetting must shift from single-shot audits to repeated feedback-driven testing to capture residual risk.
Auditors such as AI-Infra-Guard still permit up to 41.3 percent joint bypass and harm after path and surface expansion.
Successful attack implementations can be diversified through path expansion and transferred to new objectives via surface expansion.
Attack success rates increase over rounds with positive learning-curve slopes on both evaluated auditors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Skill marketplaces and repositories may need continuous or runtime monitoring mechanisms in addition to initial approval.
Agent platforms could incorporate defenses that detect and block evolved attack patterns during execution rather than relying solely on pre-deployment vetting.
The five-axis model could be tested for completeness by applying Proteus to skills from additional sources beyond the evaluated cells.

Load-bearing premise

The formalized five-axis skill-attack space and the grey-box mutation strategies adequately represent the capabilities and strategies of realistic adaptive attackers in deployed systems.

What would settle it

Direct comparison of success rates when human or automated attackers use iterative revision against the same auditors in live agent skill deployments versus the rates reported for Proteus.

Figures

Figures reproduced from arXiv: 2605.11891 by Zhaojiacheng Zhou.

**Figure 1.** Figure 1: PROTEUS overview: graph-guided chain composition (top) feeds the red-team mutator agent (bottom-left), which interacts with the defender-agnostic greybox round contract (bottom-right) via a feedback quadruple ⟨b, ℓ, F, R⟩. Round contract and adaptive leakage. Each round, the red team applies a two-stage Reason→Mutate operator detailed in Appendix B.1; the auditor, runtime, and oracle return a feedback qua… view at source ↗

**Figure 2.** Figure 2: PyRIT standalone vs. PROTEUS on AIG + DeepSeek-V4-Pro + GLM-5. Cross-tool comparison (PyRIT standalone). We complement the internal Random / Zero-shot / Blackbox baselines with Microsoft’s PyRIT [23] as a published external red team. On the same AIG + DeepSeek-V4-Pro + GLM-5 cell (20 seeds, T=5), PyRIT in its default narrativerewrite mode mutates only the documentation channel d and yields 5/100 joint-… view at source ↗

**Figure 3.** Figure 3: ASR@t trajectory per (mutator, target) cell; SV (top), AIG (bottom). 1 2 3 0 5 10 SkillVetter 10 5 1 SV / glm-5 / ds 1 2 3 4 5 4 5 3 3 2 SV / gpt-5.4-mini / ds 1 2 3 4 5 6 7 2 2 1 SV / glm-5 / kimi 1 2 3 4 5 2 1 5 3 SV / gpt-5.4-mini / kimi 1 2 3 4 5 round of first star 0 5 10 AI-Infra-Guard 4 3 3 AIG / gpt-5.4-mini / ds 1 2 3 4 5 round of first star 4 1 1 2 AIG / gpt-5.4-mini / kimi 1 2 3 4 5 round of fir… view at source ↗

**Figure 4.** Figure 4: Round of first ⋆ per cell of [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Surface expansion: (a) ⋆-winners by 7-class objective; (b) S2 vs. S3 yield per cell. Row codes: S=SV, A=AIG; G=GPT-5.4-mini, L=GLM-5; D=DeepSeek-V4-Pro, K=Kimi-2.6. The two Kimi-2.6+SV rows of [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: traces a real round-by-round trajectory across the three search modes on a single privilegeescalation seed: S1 free discovery in the top row (R1 audit-fails → R2 rewrites narrative → R3 pivots primitive and lethalises ⋆), S2 path expansion in the bottom-left (alternative implementations of the same objective), and S3 surface expansion in the bottom-right (transferring the shared file-write motif to a diff… view at source ↗

**Figure 7.** Figure 7: Per-seed ⋆ coverage across the 8 configurations of [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Per-(cell, attack-category) ⋆-rate heatmap; rows = (defender, target, mutator), columns = 7 categories [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Ablation ⋆-rate on the AIG+GLM-5+Kimi cell (N=5, T=5). guidance seeds (07/08/18/19) succeed across nearly every cell, and the cleanup / privesc seeds split along the target alignment axis. A.12 Per-Seed Ablation Grid [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Per-seed phase-1 outcome across the 8 cells; rows = seeds, columns = cells. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Per-(seed, ablation) ⋆ outcome. APPROVE REJECT 0 10 20 30 # rounds 32 (46%) 37 (54%) AIG audit verdict distribution across phase-1 rounds total 69 rounds, N seeds = 20/20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Phase-1 AIG audit-verdict mix. CWE-94 (19, code injection), CWE-532 (16, insertion of sensitive information into log), and CWE-78 (15, OS command injection). The distribution indicates that AIG’s blind spots cluster around narrative-cloaked information disclosure (CWE-200/532) and code-execution primitives (CWE94/78); these are the chokepoints PROTEUS’s purpose-coherent-embed and code-offload operators r… view at source ↗

**Figure 13.** Figure 13: Per-round outcome under AIG+GPT-5.4-mini; [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: AIG finding-category frequency on PROTEUS-submitted variants across the 4 AIG cells; SV is excluded (degenerate reason_code). A.18 Reproducibility & Artifact Catalog We will release the following artefacts so that other auditors can be re-measured under the same instrument. Released. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

read the original abstract

Agent skills extend LLM agents with reusable instructions, tool interfaces, and executable code, and users increasingly install third-party skills from marketplaces, repositories, and community channels. Because a skill exposes both executable behavior and context-setting documentation, its deployment risk cannot be measured by single-shot audits or prompt-level red teams alone: a realistic attacker can use audit and runtime feedback to repeatedly rewrite the skill. We frame this risk as \emph{adaptive leakage} -- whether a budgeted attacker can iteratively revise a skill until it passes audit and produces verified runtime harm -- and present \ours{}, a grey-box self-evolving red-team framework for measuring it. Proteus searches a formalized five-axis skill-attack space. Each candidate is evaluated through a unified audit-sandbox-oracle pipeline that returns structured audit findings and runtime evidence to guide cross-round mutation. Beyond initial evasion, Proteus performs path expansion, which finds alternative implementations of successful attacks, and surface expansion, which transfers learned implementation patterns to new attack objectives beyond the original seed catalogue. Across eight phase-1 cells, Proteus reaches 40--90\% Attack Success Rate at $5$ rounds (ASR@5) with positive learning-curve slopes on both evaluated auditors. Phase-2 path/surface expansion produces 438 jointly bypassing and lethal variants, with SkillVetter bypassed at $\geq 93\%$ in every cell and AI-Infra-Guard, the strongest public auditor we evaluate, still admitting up to 41.3\% joint-success. These results show that current skill vetting substantially underestimates residual risk when evaluated against adaptive, feedback-driven attackers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Proteus gets high adaptive bypass rates on skill auditors but the five-axis model and mutations need more grounding to support the underestimation claim.

read the letter

Proteus shows that a self-evolving red team can reach 40-90% attack success after five rounds against skill auditors and produce 438 jointly lethal variants that still get through the stronger public tools at rates like 41% on AI-Infra-Guard and over 93% on SkillVetter. The learning curves are positive and the numbers are specific enough to be useful on their own terms. The framework itself is the clearest contribution. It defines a five-axis attack space for skills, runs candidates through a unified audit-sandbox-oracle pipeline that returns structured feedback, and then applies cross-round mutation plus path expansion for alternative implementations and surface expansion to transfer patterns to new objectives. That setup goes beyond one-shot red teaming and gives a repeatable way to measure adaptive leakage in skill marketplaces. The results against named auditors make the method concrete rather than purely conceptual. The main soft spot is the jump from these lab success rates to the claim that current vetting substantially underestimates real residual risk. That conclusion only follows if the five axes and the grey-box mutation operators are a reasonable proxy for what a budgeted real attacker could do. The abstract and stress-test note give no mapping to disclosed incidents, no comparison with human red-team tactics, and no check on whether the axes miss vectors like multi-skill composition. If the search procedure is simply stronger or narrower than plausible attacker tooling, the high numbers do not automatically translate to deployed risk. Readers focused on LLM agent marketplaces, skill vetting pipelines, or red-teaming for executable agent components will find the framework and the reported bypass figures worth examining. It is coherent on its own terms and supplies empirical data on an emerging problem, so it deserves a serious referee even if the threat-model section needs tightening in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Proteus, a grey-box self-evolving red-team framework for measuring adaptive leakage in third-party agent skills for LLM agents. It formalizes a five-axis skill-attack space and employs an audit-sandbox-oracle pipeline with cross-round mutation, path expansion, and surface expansion to iteratively revise skills. Experiments across eight phase-1 cells report 40-90% ASR@5 with positive learning-curve slopes against two auditors; phase-2 expansion yields 438 jointly bypassing lethal variants, with SkillVetter bypassed at >=93% and AI-Infra-Guard admitting up to 41.3% joint success. The central claim is that current skill vetting substantially underestimates residual risk when evaluated against adaptive, feedback-driven attackers.

Significance. If the five-axis space and mutation operators are shown to be representative of realistic budgeted attackers, the work would be significant for AI security and agent ecosystems. It supplies a concrete, quantitative method to expose gaps in static vetting, introduces the notion of adaptive leakage, and demonstrates that feedback loops plus path/surface expansion can generate large numbers of bypassing variants. The positive learning curves and reproducible metrics against public auditors (SkillVetter, AI-Infra-Guard) provide a useful benchmark for future auditor design and marketplace policies.

major comments (2)

[Abstract] Abstract: The claim that current vetting 'substantially underestimates residual risk' is load-bearing on the assumption that the formalized five-axis skill-attack space plus the grey-box mutation strategies (cross-round, path expansion, surface expansion) adequately capture realistic adaptive attacker capabilities. The abstract supplies no external grounding (mapping to disclosed incidents, comparison with human red-team strategies, or ablation of omitted dimensions such as multi-skill composition), so the reported 40-90% ASR@5 and 41.3% joint success may reflect an internally powerful search procedure rather than a faithful threat model.
[Abstract] Abstract / experimental results: The phase-2 results (438 jointly bypassing variants, >=93% bypass for SkillVetter, 41.3% for AI-Infra-Guard) are presented without accompanying details on experimental controls, the precise definition of 'lethal' runtime harm, auditor implementation specifics, or how the attack budget in rounds was allocated. These omissions prevent assessment of whether the quantitative outcomes support the underestimation conclusion or are sensitive to unstated biases in the attack space.

minor comments (2)

[Abstract] The term 'adaptive leakage' is introduced in the abstract but would benefit from an explicit formal definition or axiomatic statement early in the manuscript to aid readers unfamiliar with the framing.
[Abstract] Notation for ASR@5 and joint-success metrics is used without an accompanying table or equation that defines the exact success criteria and aggregation method across cells.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the grounding and experimental transparency of the manuscript while preserving its core contribution on adaptive leakage.

read point-by-point responses

Referee: The claim that current vetting 'substantially underestimates residual risk' is load-bearing on the assumption that the formalized five-axis skill-attack space plus the grey-box mutation strategies (cross-round, path expansion, surface expansion) adequately capture realistic adaptive attacker capabilities. The abstract supplies no external grounding (mapping to disclosed incidents, comparison with human red-team strategies, or ablation of omitted dimensions such as multi-skill composition), so the reported 40-90% ASR@5 and 41.3% joint success may reflect an internally powerful search procedure rather than a faithful threat model.

Authors: We agree that stronger external grounding would better support the threat-model assumptions. The five-axis space and operators are derived from documented LLM-agent attack patterns in the literature (e.g., iterative prompt injection and tool misuse). We will revise the abstract and add a dedicated limitations paragraph in the introduction that (1) compares the operators to published human red-team tactics, (2) explicitly notes the omission of multi-skill composition, and (3) frames the results as evidence that static vetting can underestimate risk under adaptive feedback rather than a claim of exhaustive coverage. This revision clarifies the scope without altering the quantitative findings. revision: yes
Referee: The phase-2 results (438 jointly bypassing variants, >=93% bypass for SkillVetter, 41.3% for AI-Infra-Guard) are presented without accompanying details on experimental controls, the precise definition of 'lethal' runtime harm, auditor implementation specifics, or how the attack budget in rounds was allocated. These omissions prevent assessment of whether the quantitative outcomes support the underestimation conclusion or are sensitive to unstated biases in the attack space.

Authors: We acknowledge that the abstract alone omits these details. The full manuscript defines 'lethal' harm via the oracle as verified runtime violations (unauthorized data access or malicious code execution), describes the auditors as public implementations with appendix specifications, and fixes the budget at five rounds with cross-round mutation. To improve transparency, we will expand the experimental section with explicit controls, a sensitivity discussion on attack-space biases, and pseudocode for the audit-sandbox-oracle pipeline. These additions will allow readers to evaluate reproducibility and robustness directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from defined framework against external auditors.

full rationale

The paper explicitly defines the five-axis skill-attack space, grey-box mutation strategies, audit-sandbox-oracle pipeline, and path/surface expansion operators as inputs to Proteus. It then reports concrete experimental outcomes (40-90% ASR@5, 438 variants, >=93% bypass on SkillVetter) measured against independent public auditors. The conclusion that vetting underestimates residual risk is an interpretation of these measured success rates rather than a reduction by construction to the definitions themselves. No self-citations, fitted parameters, or uniqueness theorems are invoked as load-bearing steps in the provided text. The derivation chain remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the five-axis attack space and evolutionary mutations capture realistic adaptive threats, plus the representativeness of the two evaluated auditors.

free parameters (1)

attack budget in rounds
ASR@5 is reported specifically at 5 rounds as the evaluation budget.

axioms (1)

domain assumption The formalized five-axis skill-attack space covers the relevant dimensions of possible attacks on skills
Used as the search space for candidate generation and mutation.

invented entities (1)

adaptive leakage no independent evidence
purpose: To frame the risk of iterative skill revision that evades audit yet produces runtime harm
New conceptual term introduced to describe the security problem being measured.

pith-pipeline@v0.9.0 · 5583 in / 1414 out tokens · 115203 ms · 2026-05-13T05:26:20.996431+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We frame this risk as adaptive leakage—whether a budgeted attacker can iteratively revise a skill until it passes audit and produces verified runtime harm—and present PROTEUS, a grey-box self-evolving red-team framework... Proteus searches a formalized five-axis skill-attack space.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each candidate is evaluated through a unified audit-sandbox-oracle pipeline that returns structured audit findings and runtime evidence to guide cross-round mutation.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Phase-2 path/surface expansion produces 438 jointly bypassing and lethal variants... with positive learning-curve slopes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 7 internal anchors

[1]

AgentHarm: A benchmark for measuring harmfulness of LLM agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. AgentHarm: A benchmark for measuring harmfulness of LLM agents. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. InIEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 2025

work page 2025
[4]

AgentPoison: Red-teaming LLM agents via poisoning memory or knowledge bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. AgentPoison: Red-teaming LLM agents via poisoning memory or knowledge bases. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[5]

AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi ´c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[6]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec), 2023

work page 2023
[8]

RedCodeAgent: Automatic red-teaming agent against diverse code agents.arXiv preprint arXiv:2510.02609, 2025

Chengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, and Bo Li. RedCodeAgent: Automatic red-teaming agent against diverse code agents.arXiv preprint arXiv:2510.02609, 2025

work page arXiv 2025
[9]

Skillprobe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration.arXiv preprint arXiv:2603.21019, 2026

Zihan Guo, Zhiyu Chen, Xiaohang Nie, Jianghao Lin, Yuanjian Zhou, and Weinan Zhang. SkillProbe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration.arXiv preprint arXiv:2603.21019, 2026

work page arXiv 2026
[10]

Red-teaming LLM multi-agent systems via communication attacks

Pengfei He, Yupin Lin, Shen Dong, Han Xu, Yue Xing, and Hui Liu. Red-teaming LLM multi-agent systems via communication attacks. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

work page 2025
[11]

Malicious or not? adding repository context to agent skill classification

Florian Holzbauer, David Schmidt, Gabriel Gegenhuber, Sebastian Schrittwieser, and Johanna Ullrich. Malicious or not? adding repository context to agent skill classification. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

work page 2026
[12]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[13]

Curiosity-driven red teaming for large language models

Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, and Pulkit Agrawal. Curiosity-driven red teaming for large language models. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[14]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

LLM platform security: Applying a systematic evaluation framework to OpenAI’s ChatGPT plugins

Umar Iqbal, Tadayoshi Kohno, and Franziska Roesner. LLM platform security: Applying a systematic evaluation framework to OpenAI’s ChatGPT plugins. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2024

work page 2024
[16]

EIA: Environmental injection attack on generalist web agents for privacy leakage

Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. EIA: Environmental injection attack on generalist web agents for privacy leakage. In International Conference on Learning Representations (ICLR), 2025

work page 2025
[17]

Trojan’s whisper: Stealthy manipulation of OpenClaw through injected bootstrapped guidance

Fazhong Liu, Zhuoyan Chen, Tu Lan, Haozhen Tan, Zhenyu Xu, Xiang Li, Guoxing Chen, Yan Meng, and Haojin Zhu. Trojan’s whisper: Stealthy manipulation of OpenClaw through injected bootstrapped guidance. InProceedings of the ACM Conference on Computer and Communications Security (CCS), 2026

work page 2026
[18]

AutoDAN: Generating stealthy jailbreak prompts on aligned large language models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[19]

Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026

Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Ying Zhang, and Leo Yu Zhang. Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026

work page arXiv 2026
[20]

Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills

Lijia Lv, Xuehai Tang, Jie Wen, Jizhong Han, and Songlin Hu. Structured security auditing and robustness enhancement for untrusted agent skills.arXiv preprint arXiv:2604.25109, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Tree of attacks: Jailbreaking black-box LLMs automatically

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[22]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

PyRIT: Python risk identification tool for generative AI

Microsoft AI Red Team. PyRIT: Python risk identification tool for generative AI. https://github. com/microsoft/PyRIT, 2024

work page 2024
[24]

Bernstein

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023

work page 2023
[25]

Neural exec: Learning (and learning from) execution triggers for prompt injection attacks

Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. Neural exec: Learning (and learning from) execution triggers for prompt injection attacks. InProceedings of the 2024 Workshop on Artificial Intelligence and Security (AISec), 2024

work page 2024
[26]

Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions. In2022 IEEE Symposium on Security and Privacy (S&P), 2022

work page 2022
[27]

ToolLLM: Facilitating large language models to master 16000+ real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InInternational Conference on Learning Representat...

work page 2024
[28]

Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, and Roberta Raileanu

Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, and Roberta Raileanu. Rainbow teaming: Open-ended generation of diverse adversarial prompts. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[29]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[30]

Do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. “Do anything now”: Character- izing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the ACM Conference on Computer and Communications Security (CCS), 2024

work page 2024
[31]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[32]

V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research (TMLR), 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research (TMLR), 2024

work page 2024
[33]

Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (NeurIPS), 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[34]

Re- Act: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- Act: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023. 11

work page 2023
[35]

Bench- marking and defending against indirect prompt injection attacks on large language models

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Bench- marking and defending against indirect prompt injection attacks on large language models. InProceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2025

work page 2025
[36]

AgenticRed: Evolving Agentic Systems for Red-Teaming

Jiayi Yuan, Jonathan Nöther, Natasha Jaques, and Goran Radanovi ´c. AgenticRed: Evolving agentic systems for red-teaming.arXiv preprint arXiv:2601.13518, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

EvoAgent: Towards automatic multi-agent generation via evolutionary algorithms

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Dongsheng Li, and Deqing Yang. EvoAgent: Towards automatic multi-agent generation via evolutionary algorithms. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025

work page 2025
[38]

Active attacks: Red-teaming LLMs via adaptive environments.arXiv preprint arXiv:2509.21947, 2025

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, and Minsu Kim. Active attacks: Red-teaming LLMs via adaptive environments.arXiv preprint arXiv:2509.21947, 2025

work page arXiv 2025
[39]

InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics (ACL), 2024

work page 2024
[40]

Genesis: Evolving attack strategies for LLM web agent red-teaming.arXiv preprint arXiv:2510.18314, 2025

Zheng Zhang, Jiarui He, Yuchen Cai, Deheng Ye, Peilin Zhao, Ruili Feng, and Hao Wang. Genesis: Evolving attack strategies for LLM web agent red-teaming.arXiv preprint arXiv:2510.18314, 2025

work page arXiv 2025
[41]

AutoRedTeamer: Autonomous red teaming with lifelong attack integration.arXiv preprint arXiv:2503.15754, 2025

Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Sanmi Koyejo, James Zou, and Bo Li. AutoRedTeamer: Autonomous red teaming with lifelong attack integration.arXiv preprint arXiv:2503.15754, 2025

work page arXiv 2025
[42]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

a generic auditor

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. PoisonedRAG: Knowledge corruption attacks to retrieval-augmented generation of large language models. InUSENIX Security Symposium, 2025. A Additional Results A.1 Red-Team Baselines and Run Configuration Baseline mutators.(i)Random: each round picks a uniform random combination of (c, κ, d) edits from t...

work page 2025
[44]

cat $HOME/.env fired the file-read sensor on the wrong path, because the harness puts .env at $HOME/.openclaw/workspace/.env)

Target executed something but missed the canary— the synthesiser enumerates which oracle sensors stayed dark and which harness paths would have triggered them (e.g. cat $HOME/.env fired the file-read sensor on the wrong path, because the harness puts .env at $HOME/.openclaw/workspace/.env). The mutator can then patch its script

work page
[45]

cross-auditor transfer

Zero actuation despite bypass— commands_executed=0, no file changes, attack_succeeded=False. The synthesiser surfaces the two candidate failure modes side-by-side — (a) runtime allowlist denied the primitive, (b) target alignment refused at the LLM layer — together with the target’s literal reply, leaving disambiguation to the mutator. Because audit findi...

work page
[46]

diagnosis + strategy selection

Sample efficiency.Structured Ft (category, confidence, evidence pointer) as in-context side- information is fed directly intoReasonθ, letting the LLM complete “diagnosis + strategy selection” in one forward pass without thousands of roll-outs to estimate a gradient—the ReEvo-style verbal- gradient advantage

work page
[47]

RL policy-network weight deltas do not have this readability

Interpretability.The natural-language reasoning trace ht produced in Reason directly records whya mutation was made, providing an analyzable object for RQ3 (feedback exploitation) and ablation A9 (scalarb t only). RL policy-network weight deltas do not have this readability

work page
[48]

Direct RL would stall on reward noise

No differentiable oracle required.The skill-level attack oracle is a hybrid of rules and LLM judges—non-differentiable, high-latency. Direct RL would stall on reward noise. 23

work page
[49]

not realistic

Alignment with mainstream agent-based red teaming.PAIR, TAP, Rainbow Teaming, and EvoAgent all use iterative refinement rather than end-to-end RL. Our contribution is lifting side- information from the prompt layer toskill dual-channel (code + doc)+structured finding. Formally, the two paradigms differ as: RL :θ←θ+α∇ θE[rt] Ours :s t = Mutateθ st−1,Reason...

work page