arxiv: 2604.06811 · v1 · submitted 2026-04-08 · 💻 cs.CR · cs.AI

Recognition: unknown

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Yunhao Feng , Yifan Ding , Yingshui Tan , Boren Zheng , Yanming Guo , Xiaolong Li , Kun Zhai , Yishan Li

show 1 more author

Wenke Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords backdoor attacksskill-based agentsAI agent securitypayload partitioningskill compositionbackdoor embeddingagent vulnerabilities

0 comments

The pith

SkillTrojan shows that backdoors can be hidden inside reusable skills for AI agents, reaching high attack success while keeping normal behavior largely intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that skill-based agent systems, which compose reusable skills to solve complex tasks, introduce a new attack surface where malicious logic can be embedded directly in those skills. SkillTrojan partitions an encrypted payload across multiple benign-looking skills that only reconstruct and execute the full attack when a trigger condition appears. It also provides automated ways to generate such backdoored skills from existing templates. Experiments in a code-based agent setup demonstrate attack success rates up to 97.2 percent on tasks like EHR SQL queries while clean accuracy stays at 89.3 percent. This approach works because it relies on the agent's standard skill composition rather than changing model weights or training data.

Core claim

SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems.

What carries the argument

Payload partitioning across skills, which uses the agent's own composition mechanism to reassemble and run the hidden attack from separate benign invocations.

If this is right

Skill-level backdoors achieve up to 97.2 percent attack success rate with only modest drops in clean-task accuracy.
Automated synthesis from templates allows backdoored skills to spread across agent ecosystems without manual effort.
Current agent architectures have a blind spot because they do not monitor how skills combine and execute.
Defenses will need to inspect skill composition and execution flows rather than individual skills alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents that draw from large shared skill libraries may face higher risk because a single compromised skill can be reused across many tasks.
Existing safety checks focused on model outputs or single skills are unlikely to catch attacks that activate only after composition.
The released dataset of over 3,000 backdoored skills could serve as a benchmark for testing new detection methods that track cross-skill interactions.

Load-bearing premise

That standard skill composition mechanisms will reliably reconstruct and execute the partitioned payload from multiple benign-looking skill invocations without detection or interference from the agent's safety layers.

What would settle it

A test in which the agent's safety mechanisms detect or block execution when the trigger is issued and the payload remains split across separate skill calls.

Figures

Figures reproduced from arXiv: 2604.06811 by Boren Zheng, Kun Zhai, Wenke Huang, Xiaolong Li, Yanming Guo, Yifan Ding, Yingshui Tan, Yishan Li, Yunhao Feng.

**Figure 1.** Figure 1: Skill-based agent execution model. Agents compose reusable skills for planning, memory, and tool use around an LLM core. SkillTrojan hides encrypted fragments in a few skills and, upon a trigger, reconstructs and executes a payload through standard composition. 1. Introduction Skill-based abstractions are now a prevalent design pattern in agent systems (Li et al., 2025b; Wang et al., 2025; Zheng et al., 2… view at source ↗

**Figure 2.** Figure 2: SkillTrojan: A layered backdoor attack on skill-based coding agents. The figure illustrates a multi-layer execution pipeline spanning the user query layer, LLM-based agent reasoning, reusable third-party skill execution, and side-effect execution. An attacker embeds encrypted payload fragments across multiple benign-looking skill invocations. Under a triggered query, fragments are emitted during normal ski… view at source ↗

**Figure 3.** Figure 3: reports the distribution of tool-call counts under clean and triggered queries. Triggered runs show a small but systematic increase in tool usage corresponding to fragment emission and verification, while remaining within the Clean Triggered 5 10 15 20 25 30 Tool calls Distribution summary 5 10 15 20 25 30 Tool calls per run 0.00 0.05 0.10 Probability Discrete frequency Clean Triggered Tool-call distribut… view at source ↗

**Figure 5.** Figure 5: Ablation on fragment count N under GPT-4o-mini-0718- Global. ASR is maximized at moderate fragmentation: small N increases per-tool anomaly and lowers compliance with the intended workflow, while large N increases the probability of incomplete fragment collection before verification. ACC remains stable [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 4.** Figure 4: Coverage of SkillTrojanX. We report the distribution of backdoored variants across template categories and payload families. Fragment count N. Fragmentation controls a three-way trade-off between per-tool suspiciousness, workflow completion, and reconstruction reliability. When N is too small, each tool output must carry a larger fragment. In practice this increases the salience of anomalous encoded conte… view at source ↗

**Figure 6.** Figure 6: Ablation on poisoning ratio ρ under GPT-4o-mini-0718-Global. ACC is computed on clean queries and ASR on poisoned queries. Both are stable across ρ, consistent with conditional dormancy and per-run activation [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger-payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillTrojan moves backdoors to the skill-composition layer in agents and supplies a useful dataset, but the attack's reported success rests on untested assumptions about how planners actually reassemble and run partitioned payloads.

read the letter

The main point is that this work attacks the skills and their composition rules in agent systems instead of the model weights or training data. It splits an encrypted payload across several normal-looking skills, then counts on the agent's standard composition logic to reconstruct and execute the payload only when a trigger appears. They add an automated way to generate these backdoored skills from templates and release a dataset of more than 3000 examples across different patterns. That combination is new relative to earlier backdoor papers on agents or LLMs. The concrete numbers on the EHR SQL task (up to 97% attack success with 89% clean accuracy on the tested GPT variant) give a clear starting point, and releasing the dataset is a practical contribution that others can use to test defenses or extensions. The paper flags a real gap in how modular agent designs are currently secured. The soft spots are in the evaluation details and the central assumption. The abstract gives attack and clean accuracy figures but does not spell out the full protocol, baselines, or checks for statistical significance. More critically, the attack requires that the agent's planner will call the backdoored skills in an order that allows payload re-assembly, that the reconstructed payload runs without safety-layer interference, and that benign behavior stays intact. The description does not show tests against planner nondeterminism, context truncation, or different composition implementations. If any of those steps fail in practice, the attack does not transfer. This is for people working on secure agent architectures and backdoor defenses. It is worth a serious referee because the attack surface is genuine and the dataset lowers the barrier for follow-up, even though the current evidence needs more experiments on composition robustness before the claims can be taken as settled.

Referee Report

2 major / 1 minor

Summary. The paper proposes SkillTrojan, a backdoor attack on skill-based agent systems that embeds malicious logic into otherwise benign skills. The attack partitions an encrypted payload across multiple skills so that standard composition mechanisms reconstruct and execute the payload only when a predefined trigger is present. The authors release a dataset of 3,000+ backdoored skills and evaluate the approach in a code-based agent setting on an EHR SQL task, reporting up to 97.2% attack success rate while preserving 89.3% clean accuracy on GPT-5.2-1211-Global.

Significance. If the empirical results prove robust, the work identifies a new attack surface arising from skill composition in modular agent architectures, which could inform defenses that explicitly model skill invocation order and payload reconstruction. The public dataset is a constructive contribution that enables follow-on research. The significance is currently limited by the absence of detailed evaluation protocols and explicit validation of the composition assumptions.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: the central effectiveness claim (97.2% ASR, 89.3% clean ACC) is presented without any description of the skill-composition mechanism, the payload-partitioning scheme, trigger detection logic, or robustness checks against planner nondeterminism, context truncation, or safety filters. These details are load-bearing for the claim that the attack succeeds via standard composition.
[Evaluation] Evaluation section: no information is supplied on the experimental protocol, including how clean accuracy was measured across triggers, choice of baselines, number of trials, or statistical significance. Without these, the reported metrics cannot be independently assessed for reliability.

minor comments (1)

[Abstract] Clarify the exact model identifier 'GPT-5.2-1211-Global' and whether it corresponds to a publicly available checkpoint or a hypothetical variant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater clarity on mechanisms and protocols. We will revise the manuscript to address these points directly.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: the central effectiveness claim (97.2% ASR, 89.3% clean ACC) is presented without any description of the skill-composition mechanism, the payload-partitioning scheme, trigger detection logic, or robustness checks against planner nondeterminism, context truncation, or safety filters. These details are load-bearing for the claim that the attack succeeds via standard composition.

Authors: Section 3 of the manuscript details the skill-composition mechanism (standard planner sequencing of skill calls), the payload-partitioning scheme (encrypted fragments distributed across skills), and trigger detection logic (activation on predefined input patterns). We agree the abstract and evaluation would benefit from explicit summaries of these and added robustness discussion. We will revise the abstract to include a concise overview of the mechanisms and add a robustness subsection in Evaluation covering planner nondeterminism, context truncation, and safety filter interactions. revision: yes
Referee: [Evaluation] Evaluation section: no information is supplied on the experimental protocol, including how clean accuracy was measured across triggers, choice of baselines, number of trials, or statistical significance. Without these, the reported metrics cannot be independently assessed for reliability.

Authors: We agree a more explicit protocol description is warranted for reproducibility. We will add a dedicated Experimental Protocol subsection to the Evaluation section that specifies measurement of clean accuracy on benign tasks, the baselines employed, the number of trials run, and how statistical significance is assessed. This will be integrated into the main text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack construction with direct evaluation

full rationale

The paper describes an empirical backdoor attack (SkillTrojan) that partitions payloads across skills and evaluates success via direct experiments on agent tasks such as EHR SQL. No derivation chain, equations, fitted parameters, or first-principles predictions exist that could reduce to inputs by construction. Results (e.g., 97.2% ASR, 89.3% clean ACC) are reported from explicit testing rather than any self-referential or fitted logic, making the work self-contained as a security demonstration.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical security paper proposing a concrete attack; it introduces no mathematical free parameters, no unproved axioms, and no new postulated entities. The contribution rests on the design of the attack and the empirical measurements reported.

pith-pipeline@v0.9.0 · 5551 in / 1217 out tokens · 81957 ms · 2026-05-10T17:30:41.725098+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry
cs.AI 2026-05 unverdicted novelty 8.0

Semantic manipulations of SKILL.md descriptions enable effective supply-chain attacks that bias AI agent skill registries toward adversarial skills in discovery, selection, and governance.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents
cs.CR 2026-05 unverdicted novelty 6.0

Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the Sta...

Reference graph

Works this paper leans on

21 extracted references · 20 canonical work pages · cited by 3 Pith papers · 6 internal anchors

[1]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models,

Cheng, P., Ding, Y ., Ju, T., Wu, Z., Du, W., Yi, P., Zhang, Z., and Liu, G. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models.arXiv preprint arXiv:2405.13401,

work page arXiv
[3]

Hidden ghost hand: Unveiling backdoor vulnerabilities in mllm-powered mobile gui agents

Cheng, P., Hu, H., Wu, Z., Wu, Z., Ju, T., Zhang, Z., and Liu, G. Hidden ghost hand: Unveiling backdoor vulnerabili- ties in mllm-powered mobile gui agents.arXiv preprint arXiv:2505.14418,

work page arXiv
[4]

A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models eas- ily

Ding, P., Kuang, J., Ma, D., Cao, X., Xian, Y ., Chen, J., and Huang, S. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models eas- ily. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pp. 2136–2153,

2024
[5]

BackdoorAgent: A unified framework for backdoor attacks on LLM-based agents.arXiv preprintarXiv:2601.04566, 2026

Feng, Y ., Li, Y ., Wu, Y ., Tan, Y ., Guo, Y ., Ding, Y ., Zhai, K., Ma, X., and Jiang, Y . Backdooragent: A unified frame- work for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566,

work page arXiv
[6]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793,

work page internal anchor Pith review arXiv
[7]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2409.17458 , year=

9 Preprint Version Jiang, Y ., Aggarwal, K., Laud, T., Munir, K., Pujara, J., and Mukherjee, S. Red queen: Safeguarding large language models against concealed multi-turn jailbreaking.arXiv preprint arXiv:2409.17458,

work page arXiv
[9]

Backdoorvlm: A benchmark for backdoor attacks on vision-language models

Li, J., Li, Y ., Huang, H., Chen, Y ., Wang, X., Wang, Y ., Ma, X., and Jiang, Y .-G. Backdoorvlm: A benchmark for backdoor attacks on vision-language models.arXiv preprint arXiv:2511.18921, 2025a. Li, T., Bai, C., Xu, K., Chu, C., Zhu, P., and Wang, Z. Skill matters: Dynamic skill learning for multi-agent cooperative reinforcement learning.Neural Network...

work page arXiv
[10]

Backdoor- llm: A comprehensive benchmark for backdoor attacks and defenses on large language models.arXiv preprint arXiv:2408.12798,

Li, Y ., Huang, H., Zhao, Y ., Ma, X., and Sun, J. Backdoor- llm: A comprehensive benchmark for backdoor attacks and defenses on large language models.arXiv preprint arXiv:2408.12798,

work page arXiv
[11]

M., Huang, H., Ma, X., and Sun, J

Li, Y ., Li, Z., Zhao, W., Min, N. M., Huang, H., Ma, X., and Sun, J. Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprint arXiv:2511.16709, 2025c. Liu, A. Z., Choi, J., Sohn, S., Fu, Y ., Kim, J., Kim, D.-K., Wang, X., Yoo, J., and Lee, H. Skillact: Using skill ab- stractions improves llm agents. InICML 2024 Workshop on LLMs and Cognition,

work page arXiv 2024
[12]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

work page internal anchor Pith review arXiv
[13]

arXiv preprint arXiv:2501.04575 , year=

Liu, Y ., Li, P., Wei, Z., Xie, C., Hu, X., Xu, X., Zhang, S., Han, X., Yang, H., and Wu, F. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection. arXiv preprint arXiv:2501.04575,

work page arXiv
[14]

arXiv preprint arXiv:2504.06821 , year=

Wang, Z. Z., Gandhi, A., Neubig, G., and Fried, D. Inducing programmatic skills for agentic tasks.arXiv preprint arXiv:2504.06821,

work page arXiv
[15]

Badchain: Backdoor chain-of-thought prompting for large language models.arXiv preprint arXiv:2401.12242,

Xiang, Z., Jiang, F., Xiong, Z., Ramasubramanian, B., Poovendran, R., and Li, B. Badchain: Backdoor chain- of-thought prompting for large language models.arXiv preprint arXiv:2401.12242,

work page arXiv
[16]

Advagent: Controllable blackbox red- teaming on web agents.arXiv preprint arXiv:2410.17401,

Xu, C., Kang, M., Zhang, J., Liao, Z., Mo, L., Yuan, M., Sun, H., and Li, B. Advagent: Controllable blackbox red- teaming on web agents.arXiv preprint arXiv:2410.17401,

work page arXiv
[17]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Zhang, H., Huang, J., Mei, K., Yao, Y ., Wang, Z., Zhan, C., Wang, H., and Zhang, Y . Agent security bench (asb): For- malizing and benchmarking attacks and defenses in llm- based agents.arXiv preprint arXiv:2410.02644,

work page internal anchor Pith review arXiv
[18]

Usb: A comprehensive and unified safety evaluation benchmark for multimodal large language models

Zheng, B., Chen, G., Zhong, H., Teng, Q., Tan, Y ., Liu, Z., Wang, W., Liu, J., Yang, J., Jing, H., et al. Usb: A comprehensive and unified safety evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2505.23793, 2025a. Zheng, B., Fatemi, M. Y ., Jin, X., Wang, Z. Z., Gandhi, A., Song, Y ., Gu, Y ., Srinivasa, J., Liu, G., Neubig,...

work page arXiv
[19]

Demonagent: Dynamically encrypted multi-backdoor implantation attack on llm-based agent.arXiv preprint arXiv:2502.12575,

Zhu, P., Zhou, Z., Zhang, Y ., Yan, S., Wang, K., and Su, S. Demonagent: Dynamically encrypted multi-backdoor implantation attack on llm-based agent.arXiv preprint arXiv:2502.12575,

work page arXiv
[20]

AutoDAN: Interpretable Gradient- Based Adversarial Attacks on Large Language Mod- els

Zhu, S., Zhang, R., An, B., Wu, G., Barrow, J., Wang, Z., Huang, F., Nenkova, A., and Sun, T. Autodan: inter- pretable gradient-based adversarial attacks on large lan- guage models.arXiv preprint arXiv:2310.15140,

work page arXiv
[21]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv