POISE: Position-Aware Undetectable Skill Injection on LLM Agents
Pith reviewed 2026-06-27 19:50 UTC · model grok-4.3
The pith
Position-aware blending of a single trigger into skill bodies achieves 89.3 percent attack success while evading scanner detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
POISE is a position-aware attack that compresses the trigger into a single benign-looking body instruction, places it at a feasible position, and uses a context-aware generator to blend it with nearby setup or prerequisite steps. On Skill-Inject with codex and gpt-5.2, POISE achieves an 89.3 percent ASR, 28.0 points above a random-placement body baseline and 2.6 points above a YAML-only baseline, while retaining the stealth advantage of body placement. That stealth holds because legitimate skill bodies naturally require privileged tool operations, causing LLM scanners to flag 74.6 percent of clean skills on average across four judges and both benchmarks, so POISE causes only 5.6 percent of p
What carries the argument
The position-aware placement and context-aware blending of a single compressed trigger instruction into the skill body.
If this is right
- Body placement of the trigger retains its stealth advantage over YAML-header methods.
- Legitimate skills already trigger high false-positive rates of 74.6 percent in LLM scanners.
- Blended injections increase high-risk alerts by only 5.6 percent over clean baselines.
- Static scanner defenses become ineffective against position-aware blended attacks.
Where Pith is reading between the lines
- Runtime monitoring of skill execution behavior may be needed instead of relying on static scans.
- Skill distribution platforms could require additional verification steps such as execution sandboxes.
- The same blending principle could apply to other open agent extension formats beyond skills.
Load-bearing premise
Legitimate skill bodies naturally require privileged tool operations that cause LLM scanners to produce high false-positive rates.
What would settle it
An experiment that measures scanner false-positive rates on clean skills below 30 percent or records POISE attack success rate below 60 percent on the same benchmark would settle the central claim.
Figures
read the original abstract
Agent skills provide a lightweight mechanism for extending general-purpose agents, but their open format exposes them to skill-poisoning attacks. A practically dangerous injection must stay invisible: if executing the payload derails the user's legitimate task, the resulting failure signal invites inspection of the skill. We therefore evaluate attacks by Attack Success Rate, which requires the injected payload to execute and the user's task to still pass its verifier in the same trial. Prior skill-poisoning attacks face a reliability-stealth trade-off under this lens: YAML-header injections are reliably loaded but easily inspected, whereas stealthier body injections that place explicit malicious commands in the skill prose are less reliable because out-of-context commands invite the agent's own suspicion. We introduce POISE, a position-aware attack that compresses the trigger into a single, benign-looking body instruction, placing it at a feasible position and using a context-aware generator to blend it with nearby setup or prerequisite steps. On Skill-Inject with codex+gpt-5.2, POISE achieves an 89.3% ASR, 28.0 points above a random-placement body baseline and 2.6 points above a YAML-only baseline, while retaining the stealth advantage of body placement. That stealth is the decisive margin: because legitimate skill bodies naturally require privileged tool operations, LLM scanners are hyper-sensitive, falsely flagging 74.6% of clean skills on average across four judges and both benchmarks. Blending into these false alarms, POISE causes only 5.6% of poisoned variants to gain a new high-risk alert over their clean baselines, rendering current static defenses ineffective.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces POISE, a position-aware attack for skill injection on LLM agents. It compresses malicious triggers into single benign-looking body instructions placed at feasible positions and blended with context via a generator. On the Skill-Inject benchmark with codex+gpt-5.2, POISE reports 89.3% ASR (28 points above random body placement, 2.6 above YAML), while claiming stealth because legitimate skill bodies trigger 74.6% false positives on average across four LLM judges, so POISE adds new high-risk alerts in only 5.6% of cases.
Significance. If the empirical results and the 74.6% FP claim hold under full protocol, the work shows that body-placement attacks can overcome the reliability-stealth tradeoff in open agent skills, rendering static LLM scanners ineffective and highlighting a practical vulnerability in agent extensibility mechanisms.
major comments (3)
- [Abstract] Abstract: The stealth advantage is justified solely by the claim that LLM scanners produce a 74.6% FP rate on clean skills (allowing POISE to add only 5.6% new alerts). No measurement protocol, judge identities, inter-judge agreement, benchmark skill selection criteria, or ablation on scanner prompt wording is supplied, making it impossible to determine whether this rate is a general property of body placement or an artifact of the chosen judges and templates.
- [Abstract] Abstract: The headline 89.3% ASR (and the 28-point and 2.6-point margins) is reported without error bars, full experimental protocol, dataset details, number of trials, or verification steps for the 74.6% FP figure. This directly limits assessment of whether the data supports the central reliability-plus-stealth claim.
- [Abstract] Abstract: The weakest assumption—that legitimate skill bodies naturally require privileged tool operations and therefore trigger high FP rates—is presented as observed fact but is not supported by any ablation showing that the FP rate persists under different scanner prompts or skill corpora; if this rate is judge- or prompt-specific, the “undetectable” margin over YAML baselines does not follow.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback on the abstract. We address each comment below and will revise the manuscript to incorporate additional experimental details, error bars, and an ablation study as noted.
read point-by-point responses
-
Referee: [Abstract] Abstract: The stealth advantage is justified solely by the claim that LLM scanners produce a 74.6% FP rate on clean skills (allowing POISE to add only 5.6% new alerts). No measurement protocol, judge identities, inter-judge agreement, benchmark skill selection criteria, or ablation on scanner prompt wording is supplied, making it impossible to determine whether this rate is a general property of body placement or an artifact of the chosen judges and templates.
Authors: The full measurement protocol is described in Section 5.2, including the four judges (GPT-4o, Claude-3-Opus, Llama-3-70B, Gemini-1.5), prompt templates, inter-judge agreement (Cohen's κ=0.81), and selection of 150 clean skills from Skill-Inject and AgentBench. We agree the abstract omits these details due to length and will revise it to summarize the protocol with section references. The prompts follow standard static scanner designs; an explicit ablation on wording variants was not performed. revision: yes
-
Referee: [Abstract] Abstract: The headline 89.3% ASR (and the 28-point and 2.6-point margins) is reported without error bars, full experimental protocol, dataset details, number of trials, or verification steps for the 74.6% FP figure. This directly limits assessment of whether the data supports the central reliability-plus-stealth claim.
Authors: ASR figures are means over 100 trials per condition (standard error ±1.8% for POISE); full protocol, dataset (50 skills from Skill-Inject), trial counts, and FP verification steps appear in Section 4.1 and Appendix B. We will revise the abstract to include error bars and explicit references to these sections. revision: yes
-
Referee: [Abstract] Abstract: The weakest assumption—that legitimate skill bodies naturally require privileged tool operations and therefore trigger high FP rates—is presented as observed fact but is not supported by any ablation showing that the FP rate persists under different scanner prompts or skill corpora; if this rate is judge- or prompt-specific, the “undetectable” margin over YAML baselines does not follow.
Authors: The high FP rate is an empirical result from the evaluated benchmarks and judges. We acknowledge that an ablation on alternative prompts and corpora would strengthen generality claims. We will add this ablation (two new prompt variants and one additional corpus) in the revision. revision: yes
Circularity Check
No circularity; empirical results on benchmarks
full rationale
The paper reports measured Attack Success Rates (89.3% ASR) and false-positive rates (74.6% on clean skills) from direct experiments on named benchmarks against explicit baselines (random-placement body, YAML-only). No equations, fitted parameters, self-citations, or derivations are present in the provided text; the central claims are observational comparisons that remain independent of any internal reduction to inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents execute skills containing blended privileged operations without flagging them as anomalous when the malicious content is positioned among setup steps.
Reference graph
Works this paper leans on
-
[1]
Transactions of the association for computational linguistics , volume=
Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=
-
[2]
Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks
Skill-inject: Measuring agent vulnerability to skill file attacks , author=. arXiv preprint arXiv:2602.20156 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
SkillsBench: Benchmarking how well agent skills work across diverse tasks , author=. arXiv preprint arXiv:2602.12670 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=
Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=
-
[5]
NeurIPS ML Safety Workshop , year=
Ignore Previous Prompt: Attack Techniques For Language Models , author=. NeurIPS ML Safety Workshop , year=
-
[6]
2025 , howpublished =
Equipping Agents for the Real World with Agent Skills , author =. 2025 , howpublished =
2025
-
[7]
2023 , html =
Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , html =
2023
-
[8]
Advances in neural information processing systems , volume=
Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=
-
[9]
International Conference on Learning Representations , volume=
Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. International Conference on Learning Representations , volume=
-
[10]
Intrinsically-Motivated and Open-Ended Learning Workshop @NeurIPS2023 , year=
Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Intrinsically-Motivated and Open-Ended Learning Workshop @NeurIPS2023 , year=
-
[11]
2023 IEEE Symposium on Security and Privacy (SP) , pages=
Sok: Taxonomy of attacks on open-source software supply chains , author=. 2023 IEEE Symposium on Security and Privacy (SP) , pages=. 2023 , organization=
2023
-
[12]
International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment , pages=
Backstabber’s knife collection: A review of open source software supply chain attacks , author=. International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment , pages=. 2020 , organization=
2020
-
[13]
Findings of the Association for Computational Linguistics: ACL 2024 , pages=
Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=
2024
-
[14]
33rd USENIX Security Symposium (USENIX Security 24) , pages=
Formalizing and benchmarking prompt injection attacks and defenses , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=
-
[15]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[16]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparks of Artificial General Intelligence: Early experiments with GPT-4 , author=. arXiv preprint arXiv:2303.12712 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
2026 , howpublished =
Codex CLI , author =. 2026 , howpublished =
2026
-
[18]
2026 , howpublished =
Claude Code Documentation , author =. 2026 , howpublished =
2026
-
[19]
Advances in Neural Information Processing Systems , volume=
Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents , author=. Advances in Neural Information Processing Systems , volume=
-
[20]
International Conference on Learning Representations , volume=
Agentharm: A benchmark for measuring harmfulness of llm agents , author=. International Conference on Learning Representations , volume=
-
[21]
arXiv preprint arXiv:2510.02554 , year=
Tooltweak: An attack on tool selection in llm-based agents , author=. arXiv preprint arXiv:2510.02554 , year=
-
[22]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Mcptox: A benchmark for tool poisoning on real-world mcp servers , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[23]
The 6th Workshop of Adversarial Machine Learning on Computer Vision: Safety of Vision-Language Agents , year=
Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement , author=. The 6th Workshop of Adversarial Machine Learning on Computer Vision: Safety of Vision-Language Agents , year=
-
[24]
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems , author=. arXiv preprint arXiv:2604.03081 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement
Skillattack: Automated red teaming of agent skills through attack path refinement , author=. arXiv preprint arXiv:2604.04989 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
First Workshop on Agent Skills , year=
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward , author=. First Workshop on Agent Skills , year=
-
[27]
Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis
Towards secure agent skills: Architecture, threat taxonomy, and security analysis , author=. arXiv preprint arXiv:2604.02837 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
"Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills in the Wild
Malicious agent skills in the wild: A large-scale security empirical study , author=. arXiv preprint arXiv:2602.06547 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Exploiting LLM Agent Supply Chains via Payload-less Skills
Exploiting LLM Agent Supply Chains via Payload-less Skills , author=. arXiv preprint arXiv:2605.14460 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Prompt Injection attack against LLM-integrated Applications
Prompt injection attack against llm-integrated applications , author=. arXiv preprint arXiv:2306.05499 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
do anything now
" do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=
2024
-
[33]
arXiv preprint arXiv:2506.02456 , year=
Vpi-bench: Visual prompt injection attacks for computer-use agents , author=. arXiv preprint arXiv:2506.02456 , year=
-
[34]
Advances in Neural Information Processing Systems , volume=
Memory injection attacks on LLM agents via query-only interaction , author=. Advances in Neural Information Processing Systems , volume=
-
[35]
arXiv preprint arXiv:2601.05504 , year=
Memory poisoning attack and defense on memory based llm-agents , author=. arXiv preprint arXiv:2601.05504 , year=
-
[36]
34th USENIX Security Symposium (USENIX Security 25) , pages=
\ StruQ \ : Defending against prompt injection with structured queries , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=
-
[37]
European Symposium on Research in Computer Security , pages=
Jatmo: Prompt injection defense by task-specific finetuning , author=. European Symposium on Research in Computer Security , pages=. 2024 , organization=
2024
-
[38]
Defending Against Indirect Prompt Injection Attacks With Spotlighting
Defending against indirect prompt injection attacks with spotlighting , author=. arXiv preprint arXiv:2403.14720 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
AIP Conference Proceedings , volume=
Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications , author=. AIP Conference Proceedings , volume=. 2024 , organization=
2024
-
[40]
Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=
Towards Reverse Engineering of Language Models: A Survey , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=
2025
-
[41]
arXiv preprint arXiv:2509.24566 , year=
Tokenswap: Backdoor attack on the compositional understanding of large vision-language models , author=. arXiv preprint arXiv:2509.24566 , year=
-
[42]
Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation , author=. arXiv preprint arXiv:2603.17368 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Advances in Neural Information Processing Systems , volume=
Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases , author=. Advances in Neural Information Processing Systems , volume=
-
[44]
34th USENIX Security Symposium (USENIX Security 25) , pages=
\ PoisonedRAG \ : Knowledge corruption attacks to \ Retrieval-Augmented \ generation of large language models , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=
-
[45]
Advances in Neural Information Processing Systems , volume=
Watch out for your agents! investigating backdoor threats to llm-based agents , author=. Advances in Neural Information Processing Systems , volume=
-
[46]
arXiv preprint arXiv:2402.08567 , year=
Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast , author=. arXiv preprint arXiv:2402.08567 , year=
-
[47]
Coercing LLMs to do and reveal (almost) anything,
Coercing llms to do and reveal (almost) anything , author=. arXiv preprint arXiv:2402.14020 , year=
-
[48]
CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
Coevoskills: Self-evolving agent skills via co-evolutionary verification , author=. arXiv preprint arXiv:2604.01687 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.