SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction
Pith reviewed 2026-06-28 14:58 UTC · model grok-4.3
The pith
Agents stay vulnerable to poisoned skills that deliver fixed harm or mutate across repeated uses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Third-party skills constitute a persistent attack surface because agents are expected to invoke and follow them; SkillHarm shows that both immediate fixed-payload poisoning and deferred self-mutating poisoning can be constructed at scale, that agents execute the resulting harms at high rates, and that observed failures often trace to simple non-engagement with the poisoned skill rather than active resistance.
What carries the argument
The SkillHarm benchmark, which pairs Fixed-Payload Poisoning and Self-Mutating Poisoning attack scenarios with a taxonomy of twelve risk types that target data pipelines, system environments, or agent autonomy.
If this is right
- A single poisoned skill package can compromise any task session that calls it.
- An initially safe skill can silently rewrite itself to cause harm only on later reuse.
- Many apparent attack failures occur because the agent simply ignores the poisoned file rather than detecting the threat.
- Existing defenses do not reliably block either form of skill poisoning.
- Automated construction pipelines can produce hundreds of attack samples without manual enumeration of risks.
Where Pith is reading between the lines
- Verification of skill file integrity before first use could reduce exposure in both attack scenarios.
- The same lifecycle model might apply to other trusted agent components such as tools or shared memory stores.
- Tests on agents with different architectures would show whether susceptibility varies by design choices.
- The twelve-category risk taxonomy could guide systematic auditing of production agent deployments.
Load-bearing premise
The evaluation assumes agents will invoke and run the supplied poisoned skills during normal task sessions without built-in checks that would block or refuse them.
What would settle it
An agent that consistently refuses to load or execute any poisoned skill file, across both fixed-payload and self-mutating cases and across repeated task sessions, would show the claimed vulnerability does not hold.
read the original abstract
Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SkillHarm, a benchmark and taxonomy for skill-based attacks on LLM agents across the full skill-use lifecycle. It defines two attack scenarios—Fixed-Payload Poisoning (FPP) with direct compromise upon invocation and Self-Mutating Poisoning (SMP) that defers harm via persistent mutation—and a 12-type risk taxonomy targeting data pipelines, environments, and autonomy. An automated pipeline (AutoSkillHarm) using coding agents generates 879 attack samples over 71 skills; experiments report ASR up to 86.3% (FPP) and 69.3% (SMP), while noting that many non-successes arise from non-engagement rather than resistance.
Significance. If the empirical claims hold after clarification of metrics, the work provides a systematic, lifecycle-aware evaluation framework and scalable attack generator that could meaningfully advance understanding of third-party skill risks in agent systems. The automated construction pipeline and explicit distinction between engagement failures and resistance are constructive contributions that go beyond ad-hoc lists in prior work.
major comments (3)
- [Experiments section] Experiments section (and abstract): The headline ASR figures (86.3% FPP, 69.3% SMP) are presented without an explicit, reproducible definition of attack success, including whether success is conditioned on the agent invoking the poisoned skill, how non-engagement cases are scored, and what baselines or control conditions (e.g., clean skills, refusal-enabled agents) were used. This definition is load-bearing for the central vulnerability claim.
- [Benchmark construction] Benchmark construction (§3 or equivalent): The manuscript states that 879 samples were produced and assigned to 12 risk types but provides insufficient detail on validation procedures, inter-annotator agreement, or automated checks that the generated skills actually realize the intended harm category and do not contain extraneous errors. Without this, the scale and taxonomy claims rest on unverified construction output.
- [Evaluation methodology] Evaluation methodology: The reported success rates presuppose that agents will treat supplied third-party skills as normal workflow components and execute them; the paper acknowledges non-engagement as a common failure mode but does not quantify how often this occurs across models or test variants with explicit trust/refusal logic that would prevent invocation. This assumption directly determines whether the measured ASRs demonstrate practical agent weakness.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a short table summarizing the 12 risk types with one-sentence definitions to improve readability before the detailed taxonomy section.
- Notation for FPP vs. SMP attack phases is introduced but used inconsistently in some figure captions; a single glossary or consistent abbreviation list would help.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below, providing clarifications from the manuscript and committing to revisions where they will improve rigor and reproducibility.
read point-by-point responses
-
Referee: [Experiments section] Experiments section (and abstract): The headline ASR figures (86.3% FPP, 69.3% SMP) are presented without an explicit, reproducible definition of attack success, including whether success is conditioned on the agent invoking the poisoned skill, how non-engagement cases are scored, and what baselines or control conditions (e.g., clean skills, refusal-enabled agents) were used. This definition is load-bearing for the central vulnerability claim.
Authors: We agree that an explicit definition strengthens the central claims. The manuscript already notes that many failures arise from non-engagement rather than resistance, but we will add a dedicated paragraph in the Experiments section (and update the abstract) defining ASR as the rate at which the agent invokes the poisoned skill and executes the harmful payload. Non-engagement will be reported as a distinct metric. We will also include baseline results with clean skills and discuss refusal-enabled variants. These additions will be made in the revision. revision: yes
-
Referee: [Benchmark construction] Benchmark construction (§3 or equivalent): The manuscript states that 879 samples were produced and assigned to 12 risk types but provides insufficient detail on validation procedures, inter-annotator agreement, or automated checks that the generated skills actually realize the intended harm category and do not contain extraneous errors. Without this, the scale and taxonomy claims rest on unverified construction output.
Authors: We will expand the benchmark construction section to detail the automated validation procedures in AutoSkillHarm. This includes syntax and execution checks performed by the coding agents, filtering for alignment with the target risk type, and post-generation quality metrics such as the proportion of valid samples. As generation is fully automated, inter-annotator agreement does not apply; we will instead report the pipeline's yield and error rates. These details will be added without requiring new experiments. revision: yes
-
Referee: [Evaluation methodology] Evaluation methodology: The reported success rates presuppose that agents will treat supplied third-party skills as normal workflow components and execute them; the paper acknowledges non-engagement as a common failure mode but does not quantify how often this occurs across models or test variants with explicit trust/refusal logic that would prevent invocation. This assumption directly determines whether the measured ASRs demonstrate practical agent weakness.
Authors: The manuscript explicitly distinguishes non-engagement from resistance. We will add quantitative engagement-rate breakdowns per model in the revised Experiments section. Our evaluation targets standard agent configurations; testing explicit trust/refusal logic would require framework modifications beyond the current benchmark scope and will be noted as a limitation with suggestions for future work. The high ASRs conditional on engagement still demonstrate the practical risk under the lifecycle-aware threat model. revision: partial
Circularity Check
No circularity: empirical benchmark construction and measurement
full rationale
The paper introduces SkillHarm as an empirical benchmark of constructed attacks (FPP and SMP scenarios) evaluated via direct experimentation on agents, reporting measured attack success rates from 879 samples. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. The central results are outcomes of an automated construction pipeline (AutoSkillHarm) and testing, not reductions to inputs by definition. This is self-contained empirical work with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agents are expected to implicitly follow and execute third-party skills, rendering them a vulnerable attack surface.
invented entities (3)
-
Fixed-Payload Poisoning (FPP)
no independent evidence
-
Self-Mutating Poisoning (SMP)
no independent evidence
-
12 risk types taxonomy
no independent evidence
Reference graph
Works this paper leans on
-
[1]
2023 , eprint=
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection , author=. 2023 , eprint=
2023
-
[2]
Snyk Agent Scan: Security Scanner for AI Agents, MCP Servers and Agent Skills , year =
-
[3]
2025 , publisher =
V, Jay and Wang, Frank and Raad, Dax and Elmore, Adam , title =. 2025 , publisher =
2025
-
[4]
Skill Scanner: Security Scanner for Agent Skills , year =
-
[5]
2026 , eprint=
Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale , author=. 2026 , eprint=
2026
-
[6]
2026 , eprint=
Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study , author=. 2026 , eprint=
2026
-
[7]
2026 , eprint=
SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement , author=. 2026 , eprint=
2026
-
[8]
2026 , eprint=
Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks , author=. 2026 , eprint=
2026
-
[9]
2024 , eprint=
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases , author=. 2024 , eprint=
2024
-
[10]
2026 , eprint=
BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning , author=. 2026 , eprint=
2026
-
[11]
2026 , eprint=
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems , author=. 2026 , eprint=
2026
-
[12]
2025 , eprint=
Prompt Injection attack against LLM-integrated Applications , author=. 2025 , eprint=
2025
-
[13]
2024 , eprint=
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , author=. 2024 , eprint=
2024
-
[14]
arXiv preprint arXiv:2512.24601 , year=
Recursive language models , author=. arXiv preprint arXiv:2512.24601 , year=
-
[15]
arXiv preprint arXiv:2603.20432 , year=
Coding Agents are Effective Long-Context Processors , author=. arXiv preprint arXiv:2603.20432 , year=
-
[16]
arXiv preprint arXiv:2603.25723 , year=
Natural-language agent harnesses , author=. arXiv preprint arXiv:2603.25723 , year=
-
[17]
2025 , eprint=
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks , author=. 2025 , eprint=
2025
-
[18]
2026 , eprint=
MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols , author=. 2026 , eprint=
2026
-
[19]
2026 , eprint=
MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers , author=. 2026 , eprint=
2026
-
[20]
2025 , eprint=
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents , author=. 2025 , eprint=
2025
-
[21]
2025 , eprint=
RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents , author=. 2025 , eprint=
2025
-
[22]
2024 , eprint=
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , author=. 2024 , eprint=
2024
-
[23]
arXiv preprint arXiv:2510.16259 , year=
Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense , author=. arXiv preprint arXiv:2510.16259 , year=
-
[24]
arXiv preprint arXiv:2602.08995 , year=
When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents , author=. arXiv preprint arXiv:2602.08995 , year=
-
[25]
arXiv preprint arXiv:2406.09187 , year=
Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning , author=. arXiv preprint arXiv:2406.09187 , year=
-
[26]
Coding Agents with Multimodal Browsing are Generalist Problem Solvers
Soni, Aditya Bharat and Li, Boxuan and Wang, Xingyao and Chen, Valerie and Neubig, Graham. Coding Agents with Multimodal Browsing are Generalist Problem Solvers. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.318
-
[27]
2024 , eprint=
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents , author=. 2024 , eprint=
2024
-
[28]
2025 , eprint=
EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage , author=. 2025 , eprint=
2025
-
[29]
2026 , eprint=
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments , author=. 2026 , eprint=
2026
-
[30]
2026 , eprint=
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis , author=. 2026 , eprint=
2026
-
[31]
2026 , eprint=
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills , author=. 2026 , eprint=
2026
-
[32]
2026 , month = feb, day =
Oliveira, Alfredo and Tancio, Buddy and Fiser, David and Lin, Philippe and Reyes, Roel , title =. 2026 , month = feb, day =
2026
-
[33]
2026 , howpublished =
2026
-
[34]
, title=
Anthropic. , title=. 2026 , url=
2026
-
[35]
, title=
OpenAI. , title=. 2026 , url=
2026
-
[36]
, title=
Google. , title=. 2025 , url=
2025
-
[37]
, title=
Anthropic. , title=. 2025 , url=
2025
-
[38]
arXiv preprint arXiv:2602.12670 , year=
SkillsBench: Benchmarking how well agent skills work across diverse tasks , author=. arXiv preprint arXiv:2602.12670 , year=
-
[39]
Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=
Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=
-
[40]
arXiv preprint arXiv:2605.12015 , year=
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces , author=. arXiv preprint arXiv:2605.12015 , year=
-
[41]
arXiv preprint arXiv:2604.15415 , year=
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents? , author=. arXiv preprint arXiv:2604.15415 , year=
-
[42]
AGENTVIGIL : Automatic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents
Wang, Zhun and Siu, Vincent and Ye, Zhe and Shi, Tianneng and Nie, Yuzhou and Zhao, Xuandong and Wang, Chenguang and Guo, Wenbo and Song, Dawn. AGENTVIGIL : Automatic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1258
-
[43]
arXiv preprint arXiv:2508.10880 , year=
Searching for privacy risks in llm agents via simulation , author=. arXiv preprint arXiv:2508.10880 , year=
-
[44]
arXiv preprint arXiv:2603.28052 , year=
Meta-harness: End-to-end optimization of model harnesses , author=. arXiv preprint arXiv:2603.28052 , year=
-
[45]
International Conference on Learning Representations , volume=
Autodan: Generating stealthy jailbreak prompts on aligned large language models , author=. International Conference on Learning Representations , volume=
-
[46]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
Red teaming language models with language models , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
2022
-
[47]
arXiv preprint arXiv:2604.04989 , year=
SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement , author=. arXiv preprint arXiv:2604.04989 , year=
-
[48]
arXiv preprint arXiv:2604.06811 , year=
Skilltrojan: Backdoor attacks on skill-based agent systems , author=. arXiv preprint arXiv:2604.06811 , year=
-
[49]
arXiv preprint arXiv:2604.02837 , year=
Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis , author=. arXiv preprint arXiv:2604.02837 , year=
-
[50]
arXiv preprint arXiv:2510.26328 , year=
Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections , author=. arXiv preprint arXiv:2510.26328 , year=
-
[51]
arXiv preprint arXiv:2602.12430 , year=
Agent skills for large language models: Architecture, acquisition, security, and the path forward , author=. arXiv preprint arXiv:2602.12430 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.