Recognition: unknown
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3
The pith
SkillTrojan shows that backdoors can be hidden inside reusable skills for AI agents, reaching high attack success while keeping normal behavior largely intact.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems.
What carries the argument
Payload partitioning across skills, which uses the agent's own composition mechanism to reassemble and run the hidden attack from separate benign invocations.
If this is right
- Skill-level backdoors achieve up to 97.2 percent attack success rate with only modest drops in clean-task accuracy.
- Automated synthesis from templates allows backdoored skills to spread across agent ecosystems without manual effort.
- Current agent architectures have a blind spot because they do not monitor how skills combine and execute.
- Defenses will need to inspect skill composition and execution flows rather than individual skills alone.
Where Pith is reading between the lines
- Agents that draw from large shared skill libraries may face higher risk because a single compromised skill can be reused across many tasks.
- Existing safety checks focused on model outputs or single skills are unlikely to catch attacks that activate only after composition.
- The released dataset of over 3,000 backdoored skills could serve as a benchmark for testing new detection methods that track cross-skill interactions.
Load-bearing premise
That standard skill composition mechanisms will reliably reconstruct and execute the partitioned payload from multiple benign-looking skill invocations without detection or interference from the agent's safety layers.
What would settle it
A test in which the agent's safety mechanisms detect or block execution when the trigger is issued and the payload remains split across separate skill calls.
Figures
read the original abstract
Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger-payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SkillTrojan, a backdoor attack on skill-based agent systems that embeds malicious logic into otherwise benign skills. The attack partitions an encrypted payload across multiple skills so that standard composition mechanisms reconstruct and execute the payload only when a predefined trigger is present. The authors release a dataset of 3,000+ backdoored skills and evaluate the approach in a code-based agent setting on an EHR SQL task, reporting up to 97.2% attack success rate while preserving 89.3% clean accuracy on GPT-5.2-1211-Global.
Significance. If the empirical results prove robust, the work identifies a new attack surface arising from skill composition in modular agent architectures, which could inform defenses that explicitly model skill invocation order and payload reconstruction. The public dataset is a constructive contribution that enables follow-on research. The significance is currently limited by the absence of detailed evaluation protocols and explicit validation of the composition assumptions.
major comments (2)
- [Abstract and Evaluation] Abstract and Evaluation section: the central effectiveness claim (97.2% ASR, 89.3% clean ACC) is presented without any description of the skill-composition mechanism, the payload-partitioning scheme, trigger detection logic, or robustness checks against planner nondeterminism, context truncation, or safety filters. These details are load-bearing for the claim that the attack succeeds via standard composition.
- [Evaluation] Evaluation section: no information is supplied on the experimental protocol, including how clean accuracy was measured across triggers, choice of baselines, number of trials, or statistical significance. Without these, the reported metrics cannot be independently assessed for reliability.
minor comments (1)
- [Abstract] Clarify the exact model identifier 'GPT-5.2-1211-Global' and whether it corresponds to a publicly available checkpoint or a hypothetical variant.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for greater clarity on mechanisms and protocols. We will revise the manuscript to address these points directly.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: the central effectiveness claim (97.2% ASR, 89.3% clean ACC) is presented without any description of the skill-composition mechanism, the payload-partitioning scheme, trigger detection logic, or robustness checks against planner nondeterminism, context truncation, or safety filters. These details are load-bearing for the claim that the attack succeeds via standard composition.
Authors: Section 3 of the manuscript details the skill-composition mechanism (standard planner sequencing of skill calls), the payload-partitioning scheme (encrypted fragments distributed across skills), and trigger detection logic (activation on predefined input patterns). We agree the abstract and evaluation would benefit from explicit summaries of these and added robustness discussion. We will revise the abstract to include a concise overview of the mechanisms and add a robustness subsection in Evaluation covering planner nondeterminism, context truncation, and safety filter interactions. revision: yes
-
Referee: [Evaluation] Evaluation section: no information is supplied on the experimental protocol, including how clean accuracy was measured across triggers, choice of baselines, number of trials, or statistical significance. Without these, the reported metrics cannot be independently assessed for reliability.
Authors: We agree a more explicit protocol description is warranted for reproducibility. We will add a dedicated Experimental Protocol subsection to the Evaluation section that specifies measurement of clean accuracy on benign tasks, the baselines employed, the number of trials run, and how statistical significance is assessed. This will be integrated into the main text. revision: yes
Circularity Check
No circularity: empirical attack construction with direct evaluation
full rationale
The paper describes an empirical backdoor attack (SkillTrojan) that partitions payloads across skills and evaluates success via direct experiments on agent tasks such as EHR SQL. No derivation chain, equations, fitted parameters, or first-principles predictions exist that could reduce to inputs by construction. Results (e.g., 97.2% ASR, 89.3% clean ACC) are reported from explicit testing rather than any self-referential or fitted logic, making the work self-contained as a security demonstration.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry
Semantic manipulations of SKILL.md descriptions enable effective supply-chain attacks that bias AI agent skill registries toward adversarial skills in discovery, selection, and governance.
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
-
When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents
Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the Sta...
Reference graph
Works this paper leans on
-
[1]
Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models,
Cheng, P., Ding, Y ., Ju, T., Wu, Z., Du, W., Yi, P., Zhang, Z., and Liu, G. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models.arXiv preprint arXiv:2405.13401,
-
[3]
Hidden ghost hand: Unveiling backdoor vulnerabilities in mllm-powered mobile gui agents
Cheng, P., Hu, H., Wu, Z., Wu, Z., Ju, T., Zhang, Z., and Liu, G. Hidden ghost hand: Unveiling backdoor vulnerabili- ties in mllm-powered mobile gui agents.arXiv preprint arXiv:2505.14418,
-
[4]
A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models eas- ily
Ding, P., Kuang, J., Ma, D., Cao, X., Xian, Y ., Chen, J., and Huang, S. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models eas- ily. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pp. 2136–2153,
2024
-
[5]
Feng, Y ., Li, Y ., Wu, Y ., Tan, Y ., Guo, Y ., Ding, Y ., Zhai, K., Ma, X., and Jiang, Y . Backdooragent: A unified frame- work for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566,
-
[6]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793,
work page internal anchor Pith review arXiv
-
[7]
Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
arXiv preprint arXiv:2409.17458 , year=
9 Preprint Version Jiang, Y ., Aggarwal, K., Laud, T., Munir, K., Pujara, J., and Mukherjee, S. Red queen: Safeguarding large language models against concealed multi-turn jailbreaking.arXiv preprint arXiv:2409.17458,
-
[9]
Backdoorvlm: A benchmark for backdoor attacks on vision-language models
Li, J., Li, Y ., Huang, H., Chen, Y ., Wang, X., Wang, Y ., Ma, X., and Jiang, Y .-G. Backdoorvlm: A benchmark for backdoor attacks on vision-language models.arXiv preprint arXiv:2511.18921, 2025a. Li, T., Bai, C., Xu, K., Chu, C., Zhu, P., and Wang, Z. Skill matters: Dynamic skill learning for multi-agent cooperative reinforcement learning.Neural Network...
-
[10]
Li, Y ., Huang, H., Zhao, Y ., Ma, X., and Sun, J. Backdoor- llm: A comprehensive benchmark for backdoor attacks and defenses on large language models.arXiv preprint arXiv:2408.12798,
-
[11]
M., Huang, H., Ma, X., and Sun, J
Li, Y ., Li, Z., Zhao, W., Min, N. M., Huang, H., Ma, X., and Sun, J. Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprint arXiv:2511.16709, 2025c. Liu, A. Z., Choi, J., Sohn, S., Fu, Y ., Kim, J., Kim, D.-K., Wang, X., Yoo, J., and Lee, H. Skillact: Using skill ab- stractions improves llm agents. InICML 2024 Workshop on LLMs and Cognition,
-
[12]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,
work page internal anchor Pith review arXiv
-
[13]
arXiv preprint arXiv:2501.04575 , year=
Liu, Y ., Li, P., Wei, Z., Xie, C., Hu, X., Xu, X., Zhang, S., Han, X., Yang, H., and Wu, F. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection. arXiv preprint arXiv:2501.04575,
-
[14]
arXiv preprint arXiv:2504.06821 , year=
Wang, Z. Z., Gandhi, A., Neubig, G., and Fried, D. Inducing programmatic skills for agentic tasks.arXiv preprint arXiv:2504.06821,
-
[15]
Xiang, Z., Jiang, F., Xiong, Z., Ramasubramanian, B., Poovendran, R., and Li, B. Badchain: Backdoor chain- of-thought prompting for large language models.arXiv preprint arXiv:2401.12242,
-
[16]
Advagent: Controllable blackbox red- teaming on web agents.arXiv preprint arXiv:2410.17401,
Xu, C., Kang, M., Zhang, J., Liao, Z., Mo, L., Yuan, M., Sun, H., and Li, B. Advagent: Controllable blackbox red- teaming on web agents.arXiv preprint arXiv:2410.17401,
-
[17]
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
Zhang, H., Huang, J., Mei, K., Yao, Y ., Wang, Z., Zhan, C., Wang, H., and Zhang, Y . Agent security bench (asb): For- malizing and benchmarking attacks and defenses in llm- based agents.arXiv preprint arXiv:2410.02644,
work page internal anchor Pith review arXiv
-
[18]
Usb: A comprehensive and unified safety evaluation benchmark for multimodal large language models
Zheng, B., Chen, G., Zhong, H., Teng, Q., Tan, Y ., Liu, Z., Wang, W., Liu, J., Yang, J., Jing, H., et al. Usb: A comprehensive and unified safety evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2505.23793, 2025a. Zheng, B., Fatemi, M. Y ., Jin, X., Wang, Z. Z., Gandhi, A., Song, Y ., Gu, Y ., Srinivasa, J., Liu, G., Neubig,...
-
[19]
Zhu, P., Zhou, Z., Zhang, Y ., Yan, S., Wang, K., and Su, S. Demonagent: Dynamically encrypted multi-backdoor implantation attack on llm-based agent.arXiv preprint arXiv:2502.12575,
-
[20]
AutoDAN: Interpretable Gradient- Based Adversarial Attacks on Large Language Mod- els
Zhu, S., Zhang, R., An, B., Wu, G., Barrow, J., Wang, Z., Huang, F., Nenkova, A., and Sun, T. Autodan: inter- pretable gradient-based adversarial attacks on large lan- guage models.arXiv preprint arXiv:2310.15140,
-
[21]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models.arXiv preprint arXiv:2307.15043,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.