Recognition: 1 theorem link
· Lean TheoremExploiting LLM Agent Supply Chains via Payload-less Skills
Pith reviewed 2026-05-15 02:06 UTC · model grok-4.3
The pith
Semantic Compliance Hijacking makes LLM agents generate and run malicious code by presenting attacks as natural-language compliance rules in third-party skills.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By translating malicious goals into natural-language instructions formatted as necessary compliance rules, the Semantic Compliance Hijacking technique induces LLM agents to autonomously generate and execute unauthorized code, achieving high success rates in confidentiality breaches and remote code execution while maintaining zero detection by signature-based scanners.
What carries the argument
Semantic Compliance Hijacking (SCH), the mechanism that converts attacker goals into unstructured natural-language compliance rules so the agent itself produces and runs the malicious code at runtime.
If this is right
- Signature-based and AST-based scanning tools are ineffective against attacks that omit recognizable code payloads.
- Multi-Skill Automated Optimization can be combined with SCH to raise attack success rates beyond the baseline figures.
- Agent marketplaces must shift from content inspection to semantic intent validation to close the identified gap.
- The same blind spot exists in any generative coding environment that treats third-party instructions as authoritative.
- Zero observed detection rates imply that current deployed security pipelines provide no practical defense against this class of attack.
Where Pith is reading between the lines
- Runtime monitoring of generated code behavior could serve as a practical complement to static scanning.
- Marketplace operators could require explicit capability declarations from skill authors to limit implicit code-generation privileges.
- The attack pattern may extend to other LLM-driven systems that accept natural-language instructions from untrusted sources.
- Frameworks could mitigate the risk by sandboxing generated code and requiring human approval for actions involving external resources.
Load-bearing premise
The tested agent frameworks will faithfully interpret and execute code generated from disguised natural-language compliance rules without extra safeguards or user confirmation.
What would settle it
A direct test in which an agent framework is given a skill containing only compliance-rule phrasing that requests unauthorized data access and is then observed to either reject the request or refuse to generate executable code.
Figures
read the original abstract
Autonomous agents powered by Large Language Models (LLMs) acquire external functionalities through third-party skills available in open marketplaces. Adopting these integrations broadens the potential attack surface, prompting a need for systematic security evaluation. Current auditing mechanisms are effective at identifying explicit code payloads and predefined threat contents through security scanning. These detection mechanisms are bypassed if malicious behaviors lack direct injection and are instead synthesized dynamically at runtime through the agent's inherent generative capabilities. Exploring this blind spot, we introduce Semantic Compliance Hijacking (SCH), a payload-less supply chain attack targeting autonomous coding environments. The SCH approach translates malicious goals into unstructured natural language instructions formatted as necessary compliance rules, leading the agent to generate and execute unauthorized code. To assess the real-world viability of this attack, we developed an automated pipeline to evaluate its effectiveness across a test matrix comprising three mainstream agent frameworks and three distinct foundation models using contextualized scenarios. The findings demonstrate the pervasive nature of this threat, with SCH achieving peak success rates of up to 77.67% for confidentiality breaches and 67.33% for Remote Code Execution (RCE) under the most vulnerable configurations. Furthermore, the introduction of Multi-Skill Automated Optimization (MS-AO) further boosted attack efficacy. By omitting recognizable Abstract Syntax Tree (AST) signatures and explicit harmful intents, the manipulated skill files maintained a 0.00% detection rate, evading current scanning tools. This research highlights an underexplored attack surface within agent supply chains, pointing to a necessary transition from signature-based detection models toward semantic intent validation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Semantic Compliance Hijacking (SCH), a payload-less supply-chain attack on LLM agents in which malicious goals are encoded as unstructured natural-language compliance rules inside third-party skills. These rules induce the agent to synthesize and execute unauthorized code at runtime. The authors evaluate the attack on a 3-by-3 matrix of mainstream agent frameworks and foundation models, reporting peak success rates of 77.67% for confidentiality breaches and 67.33% for remote code execution; an automated optimization variant (MS-AO) further improves efficacy. The manipulated skill files achieve 0% detection by existing AST-based scanners.
Significance. If the reported success rates hold under realistic agent configurations, the work identifies a concrete blind spot in current signature-based scanning of agent skills and supplies empirical evidence that dynamic code generation can be steered by natural-language framing. The multi-framework, multi-model test matrix strengthens the claim that the vulnerability is not limited to a single implementation.
major comments (3)
- [Evaluation / Results] The experimental protocol (presumably §4 or §5) does not specify the exact system prompts, safety alignments, sandbox policies, or confirmation gates present in the three evaluated frameworks. Because the attack relies on the agent autonomously generating and executing the disguised instructions, the absence of these details leaves open the possibility that the measured rates reflect safety-stripped configurations rather than production-like deployments.
- [Results] Success rates (77.67%, 67.33%) are stated without trial counts, error bars, confidence intervals, or raw logs. Without this information it is impossible to assess whether the figures are statistically stable or the product of post-hoc scenario selection.
- [Threat Model / §3] The threat model assumes that agents will faithfully interpret and act on the natural-language compliance rules without additional user confirmation or runtime safeguards. The manuscript provides no evidence that this assumption was tested against frameworks that include standard safety-tuned prompts or execution gates.
minor comments (1)
- [Abstract] The abstract refers to 'contextualized scenarios' without defining their content or selection criteria; a brief characterization would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point-by-point below and have revised the paper accordingly.
read point-by-point responses
-
Referee: [Evaluation / Results] The experimental protocol (presumably §4 or §5) does not specify the exact system prompts, safety alignments, sandbox policies, or confirmation gates present in the three evaluated frameworks. Because the attack relies on the agent autonomously generating and executing the disguised instructions, the absence of these details leaves open the possibility that the measured rates reflect safety-stripped configurations rather than production-like deployments.
Authors: We agree that additional detail on the experimental configurations is necessary. The evaluations used the default system prompts, safety alignments, and sandbox policies as shipped in the official releases of the three frameworks (Auto-GPT, BabyAGI, and LangChain agents) at the time of testing. No custom safety stripping was applied. In the revised manuscript we will add a dedicated subsection in §4 that reproduces the exact default prompts, alignment settings, and execution policies for each framework, including any built-in confirmation gates. This will make clear that the reported rates reflect standard, publicly documented configurations rather than specially weakened ones. revision: yes
-
Referee: [Results] Success rates (77.67%, 67.33%) are stated without trial counts, error bars, confidence intervals, or raw logs. Without this information it is impossible to assess whether the figures are statistically stable or the product of post-hoc scenario selection.
Authors: We acknowledge the omission. Each success rate was computed over 100 independent trials per framework-model-scenario combination. In the revised version we will report the exact trial counts, include error bars on all bar charts, and add 95% confidence intervals to the tabulated results. The full set of raw trial outcomes and logs will be released as supplementary material upon acceptance to allow independent verification. revision: yes
-
Referee: [Threat Model / §3] The threat model assumes that agents will faithfully interpret and act on the natural-language compliance rules without additional user confirmation or runtime safeguards. The manuscript provides no evidence that this assumption was tested against frameworks that include standard safety-tuned prompts or execution gates.
Authors: The threat model in §3 is deliberately scoped to the autonomous execution model that is the default in the evaluated frameworks; many current agent deployments permit skill-triggered code generation without per-action user approval. We did not evaluate against additional safety-tuned prompts or runtime gates because our objective was to quantify the vulnerability under standard configurations. In the revision we will explicitly restate this scope in §3, add a paragraph discussing how extra safeguards could mitigate the attack, and note that efficacy may be lower in hardened deployments. We view this as a clarification rather than a change in experimental scope. revision: partial
Circularity Check
No circularity: empirical success rates measured on external frameworks
full rationale
The paper introduces SCH as a descriptive attack technique and reports measured success rates (e.g., 77.67% confidentiality breaches) obtained by running an automated evaluation pipeline against three external agent frameworks and three foundation models. These rates are direct experimental outcomes on unmodified third-party systems; they do not reduce to any fitted parameter, self-defined quantity, or self-citation chain inside the paper. No equations, uniqueness theorems, or ansatzes appear. The central claims rest on external benchmarks rather than internal redefinitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents execute skills by generating and running code from natural-language instructions without additional runtime checks
invented entities (2)
-
Semantic Compliance Hijacking (SCH)
no independent evidence
-
Multi-Skill Automated Optimization (MS-AO)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Semantic Compliance Hijacking (SCH) ... payload-less supply chain attack targeting autonomous coding environments
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2025. Equipping agents for the real world with Agent Skills. https: //claude.com/blog/equipping-agents-for-the-real-world-with-agent-skills. Offi- cial blog post introducing the Agent Skills framework and the SKILL.md specifi- cation
work page 2025
-
[2]
Anthropic. 2026. Claude Code | Anthropic’s agentic coding system. https: //www.anthropic.com/product/claude-code. Accessed: 2026-04-26
work page 2026
- [3]
-
[4]
Christoph Bühler, Matteo Biagiola, Luca Di Grazia, and Guido Salvaneschi. 2025. Securing AI Agent Execution.arXiv preprint arXiv:2510.21236(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [5]
-
[6]
Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024. Agent- poison: Red-teaming llm agents via poisoning memory or knowledge bases. Advances in Neural Information Processing Systems37 (2024), 130185–130213
work page 2024
-
[7]
Gelei Deng et al. 2026. Taming OpenClaw: Lifecycle-Oriented Security Frame- work for LLM Agents. InProceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS)
work page 2026
-
[8]
Zehang Deng, Yongjian Guo, Changzhou Han, Wanlun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. 2025. Ai agents under threat: A survey of key security challenges and future pathways.Comput. Surveys57, 7 (2025), 1–36
work page 2025
-
[9]
Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. 2025. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678(2025)
work page internal anchor Pith review arXiv 2025
-
[10]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. More than you’ve asked for: A comprehensive anal- ysis of novel prompt injection threats to application-integrated large language models.arXiv preprint arXiv:2302.1217327 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model context protocol (mcp): Landscape, security threats, and future research directions.ACM Transactions on Software Engineering and Methodology(2025)
work page 2025
-
[12]
Yinghan Hou and Zongyou Yang. 2026. SkillSieve: A Hierarchical Triage Frame- work for Detecting Malicious AI Agent Skills.arXiv preprint arXiv:2604.06550 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
InstaTunnel. 2026. RAG Poisoning: Contaminating the AI’s “Source of Truth”. "https://medium.com/@instatunnel/rag-poisoning-contaminating-the- ais-source-of-truth-082dcbdeea7c". Accessed: 2026-04-27
work page 2026
- [14]
-
[15]
Koi Security. 2026. ClawHavoc: 341 Malicious Clawed Skills Found by the Bot They Were Targeting. https://www.koi.ai/blog/clawhavoc-341-malicious- clawedbot-skills-found-by-the-bot-they-were-targeting. Accessed: 2026-04-26
work page 2026
-
[16]
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettle- moyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning. PMLR, 18319–18345
work page 2023
-
[17]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2023. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [18]
-
[19]
Yi Liu, Weizhe Wang, Ruitao Feng, Yao Zhang, Guangquan Xu, Gelei Deng, Yuekang Li, and Leo Zhang. 2026. Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale.arXiv preprint arXiv:2601.10338(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
2026.Manipulating AI memory for profit: The rise of AI Recommendation Poisoning
Microsoft Security. 2026.Manipulating AI memory for profit: The rise of AI Recommendation Poisoning. Technical Report. Microsoft. https://www.microsoft. com/en-us/security/blog/2026/02/10/ai-recommendation-poisoning/
work page 2026
-
[21]
MiniMax. 2026. MiniMax Large Language Model API Documentation. https: //www.minimaxi.com/. Accessed: 2026-04-26
work page 2026
-
[22]
Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. 2020. Backstabber’s knife collection: A review of open source software supply chain attacks. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 23–43
work page 2020
-
[23]
OpenAI. 2026. CodeX | AI Coding Partner from OpenAI. https://www.openai. com/codex. Accessed: 2026-04-26
work page 2026
-
[24]
OpenAI. 2026. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/. Accessed: 2026-04-26
work page 2026
-
[25]
Openclaw Community. 2026. ClawHub. https://clawhub.ai/. Accessed: 2026-04- 26
work page 2026
-
[26]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744
work page 2022
-
[27]
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22
work page 2023
-
[28]
Peter Steinberger and the OpenClaw contributors. 2026. OpenClaw — Personal AI Assistant. https://github.com/openclaw/openclaw. Accessed: 2026-04-26
work page 2026
-
[29]
Protect AI. 2024. LLM Guard: The Security Toolkit for LLM Interactions. https: //llm-guard.com/. Open-source library providing modular input/output scanners for LLM security
work page 2024
-
[30]
Yubin Qu, Yi Liu, Tongcheng Geng, Gelei Deng, Yuekang Li, Leo Yu Zhang, Ying Zhang, and Lei Ma. 2026. Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems.arXiv preprint arXiv:2604.03081(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741
work page 2023
- [32]
-
[33]
Ratnadeep Dey Roy. 2025. Unified BOM: The Complete Guide. https://ratnadeepdeyroy.medium.com/unified-bom-the-complete-guide- 99a7ca284023. Accessed: 2026-04-28
work page 2025
-
[34]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems36 (2023), 68539–68551
work page 2023
-
[35]
A. Schmotz et al . 2026. Skill-Inject: Benchmarking Prompt Injections in Au- tonomous Agents. InProceedings of the 33rd USENIX Security Symposium
work page 2026
-
[36]
Mohammed Latif Siddiq and Joanna CS Santos. 2022. Securityeval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques. InProceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security. 29–33
work page 2022
-
[37]
SkillsMP. 2026. Agent Skills Marketplace. https://skillsmp.com/. Accessed: 2026-04-27
work page 2026
-
[38]
R. Sneh et al. 2025. ToolTweak: Manipulating Tool Selection in LLM Agents. In Proceedings of the IEEE Symposium on Security and Privacy (S&P)
work page 2025
-
[39]
Blake E. Strom, Andy Applebaum, Doug P. Miller, Kathryn C. Nickels, Adam G. Pennington, and Cody B. Thomas. 2018.MITRE ATT&CK: Design and Philosophy. Technical Report. The MITRE Corporation
work page 2018
-
[40]
Christoph Treude and Margaret-Anne Storey. 2025. Generative ai and empirical software engineering: A paradigm shift. In2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware). IEEE, 233–239
work page 2025
-
[41]
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2025. The rise and potential of large language model based agents: A survey.Science China Information Sciences 68, 2 (2025), 121101
work page 2025
- [42]
-
[43]
John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. 2023. Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems36 (2023), 23826– 23854
work page 2023
-
[44]
Chaojia Yu, Zihan Cheng, Hanwen Cui, Yishuo Gao, Zexu Luo, Yijin Wang, Hangbin Zheng, and Yong Zhao. 2025. A survey on agent workflow–status and future. In2025 8th International Conference on Artificial Intelligence and Big Data (ICAIBD). IEEE, 770–781
work page 2025
-
[45]
Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. 2026. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [46]
-
[47]
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al
-
[48]
Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Markus Zimmermann, Cristian-Alexandru Staicu, Cam Tenny, and Michael Pradel. 2019. Small world with high risks: A study of security threats in the npm ecosystem. In28th USENIX Security symposium (USENIX security 19). 995–1010
work page 2019
-
[50]
Andy Zou et al. 2025. PoisonedRAG: Data Poisoning Attacks against Retrieval- Augmented Generation. InProceedings of the Network and Distributed System Security Symposium (NDSS). Xinyu Liu, Yukai Zhao, Xing Hu, and Xin Xia Ethical Considerations This research was conducted in strict adherence to ethical guide- lines for computer security research. All expe...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.