pith. sign in

arxiv: 2606.07943 · v1 · pith:UCBG77Z6new · submitted 2026-06-06 · 💻 cs.CR · cs.AI· cs.CL

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

Pith reviewed 2026-06-27 19:50 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL
keywords skill poisoningLLM agentsattack success ratestealth injectionposition-awarebody injectionscanner false positives
0
0 comments X

The pith

Position-aware blending of a single trigger into skill bodies achieves 89.3 percent attack success while evading scanner detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that effective skill-poisoning attacks on LLM agents must execute their payload without causing the user's legitimate task to fail, since failure would prompt inspection of the skill. It shows that prior approaches face a tradeoff where YAML-header injections load reliably but are easy to spot, while body injections are harder to detect but less reliable because out-of-context commands raise suspicion. POISE addresses this by compressing the trigger into one benign-looking instruction, positioning it where it fits naturally among setup or prerequisite steps, and blending it with surrounding text via a context-aware generator. A sympathetic reader would care because open skill formats let users extend agents easily yet leave them open to hidden manipulation if attacks can stay invisible. The evaluation claims that this method reaches high success rates while producing almost no additional scanner alerts beyond those already raised by clean skills.

Core claim

POISE is a position-aware attack that compresses the trigger into a single benign-looking body instruction, places it at a feasible position, and uses a context-aware generator to blend it with nearby setup or prerequisite steps. On Skill-Inject with codex and gpt-5.2, POISE achieves an 89.3 percent ASR, 28.0 points above a random-placement body baseline and 2.6 points above a YAML-only baseline, while retaining the stealth advantage of body placement. That stealth holds because legitimate skill bodies naturally require privileged tool operations, causing LLM scanners to flag 74.6 percent of clean skills on average across four judges and both benchmarks, so POISE causes only 5.6 percent of p

What carries the argument

The position-aware placement and context-aware blending of a single compressed trigger instruction into the skill body.

If this is right

  • Body placement of the trigger retains its stealth advantage over YAML-header methods.
  • Legitimate skills already trigger high false-positive rates of 74.6 percent in LLM scanners.
  • Blended injections increase high-risk alerts by only 5.6 percent over clean baselines.
  • Static scanner defenses become ineffective against position-aware blended attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Runtime monitoring of skill execution behavior may be needed instead of relying on static scans.
  • Skill distribution platforms could require additional verification steps such as execution sandboxes.
  • The same blending principle could apply to other open agent extension formats beyond skills.

Load-bearing premise

Legitimate skill bodies naturally require privileged tool operations that cause LLM scanners to produce high false-positive rates.

What would settle it

An experiment that measures scanner false-positive rates on clean skills below 30 percent or records POISE attack success rate below 60 percent on the same benchmark would settle the central claim.

Figures

Figures reproduced from arXiv: 2606.07943 by Dehai Min, Haochang Hao, Lu Cheng, Miao Xu, Yingqiang Ge, Yunbei Zhang, Zhifang Zhang.

Figure 1
Figure 1. Figure 1: Overview of POISE. (a) Direct skill-file poisoning fragments explicit malicious steps across the file, each one risking the agent’s suspicion as out-of-context. (b) POISE places a single benign-looking line at a feasible body position, framing a script invocation as a routine prerequisite and delegating the attacker-chosen behavior to an auxiliary canary script. (2026) embed malicious logic in documentatio… view at source ↗
Figure 2
Figure 2. Figure 2: POISE is both stealthy and reliable: (a) it adds little incremental static-detection signal over already-noisy LLM judges, and (b) it triggers at a high rate from any feasible body position. signal. The opposite regime is no safer. Whole￾file ingestion makes body-position injections more reliable but also exposes more of the skill to a scan￾ner with longer context, and our results show the scanners do not … view at source ↗
read the original abstract

Agent skills provide a lightweight mechanism for extending general-purpose agents, but their open format exposes them to skill-poisoning attacks. A practically dangerous injection must stay invisible: if executing the payload derails the user's legitimate task, the resulting failure signal invites inspection of the skill. We therefore evaluate attacks by Attack Success Rate, which requires the injected payload to execute and the user's task to still pass its verifier in the same trial. Prior skill-poisoning attacks face a reliability-stealth trade-off under this lens: YAML-header injections are reliably loaded but easily inspected, whereas stealthier body injections that place explicit malicious commands in the skill prose are less reliable because out-of-context commands invite the agent's own suspicion. We introduce POISE, a position-aware attack that compresses the trigger into a single, benign-looking body instruction, placing it at a feasible position and using a context-aware generator to blend it with nearby setup or prerequisite steps. On Skill-Inject with codex+gpt-5.2, POISE achieves an 89.3% ASR, 28.0 points above a random-placement body baseline and 2.6 points above a YAML-only baseline, while retaining the stealth advantage of body placement. That stealth is the decisive margin: because legitimate skill bodies naturally require privileged tool operations, LLM scanners are hyper-sensitive, falsely flagging 74.6% of clean skills on average across four judges and both benchmarks. Blending into these false alarms, POISE causes only 5.6% of poisoned variants to gain a new high-risk alert over their clean baselines, rendering current static defenses ineffective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces POISE, a position-aware attack for skill injection on LLM agents. It compresses malicious triggers into single benign-looking body instructions placed at feasible positions and blended with context via a generator. On the Skill-Inject benchmark with codex+gpt-5.2, POISE reports 89.3% ASR (28 points above random body placement, 2.6 above YAML), while claiming stealth because legitimate skill bodies trigger 74.6% false positives on average across four LLM judges, so POISE adds new high-risk alerts in only 5.6% of cases.

Significance. If the empirical results and the 74.6% FP claim hold under full protocol, the work shows that body-placement attacks can overcome the reliability-stealth tradeoff in open agent skills, rendering static LLM scanners ineffective and highlighting a practical vulnerability in agent extensibility mechanisms.

major comments (3)
  1. [Abstract] Abstract: The stealth advantage is justified solely by the claim that LLM scanners produce a 74.6% FP rate on clean skills (allowing POISE to add only 5.6% new alerts). No measurement protocol, judge identities, inter-judge agreement, benchmark skill selection criteria, or ablation on scanner prompt wording is supplied, making it impossible to determine whether this rate is a general property of body placement or an artifact of the chosen judges and templates.
  2. [Abstract] Abstract: The headline 89.3% ASR (and the 28-point and 2.6-point margins) is reported without error bars, full experimental protocol, dataset details, number of trials, or verification steps for the 74.6% FP figure. This directly limits assessment of whether the data supports the central reliability-plus-stealth claim.
  3. [Abstract] Abstract: The weakest assumption—that legitimate skill bodies naturally require privileged tool operations and therefore trigger high FP rates—is presented as observed fact but is not supported by any ablation showing that the FP rate persists under different scanner prompts or skill corpora; if this rate is judge- or prompt-specific, the “undetectable” margin over YAML baselines does not follow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. We address each comment below and will revise the manuscript to incorporate additional experimental details, error bars, and an ablation study as noted.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The stealth advantage is justified solely by the claim that LLM scanners produce a 74.6% FP rate on clean skills (allowing POISE to add only 5.6% new alerts). No measurement protocol, judge identities, inter-judge agreement, benchmark skill selection criteria, or ablation on scanner prompt wording is supplied, making it impossible to determine whether this rate is a general property of body placement or an artifact of the chosen judges and templates.

    Authors: The full measurement protocol is described in Section 5.2, including the four judges (GPT-4o, Claude-3-Opus, Llama-3-70B, Gemini-1.5), prompt templates, inter-judge agreement (Cohen's κ=0.81), and selection of 150 clean skills from Skill-Inject and AgentBench. We agree the abstract omits these details due to length and will revise it to summarize the protocol with section references. The prompts follow standard static scanner designs; an explicit ablation on wording variants was not performed. revision: yes

  2. Referee: [Abstract] Abstract: The headline 89.3% ASR (and the 28-point and 2.6-point margins) is reported without error bars, full experimental protocol, dataset details, number of trials, or verification steps for the 74.6% FP figure. This directly limits assessment of whether the data supports the central reliability-plus-stealth claim.

    Authors: ASR figures are means over 100 trials per condition (standard error ±1.8% for POISE); full protocol, dataset (50 skills from Skill-Inject), trial counts, and FP verification steps appear in Section 4.1 and Appendix B. We will revise the abstract to include error bars and explicit references to these sections. revision: yes

  3. Referee: [Abstract] Abstract: The weakest assumption—that legitimate skill bodies naturally require privileged tool operations and therefore trigger high FP rates—is presented as observed fact but is not supported by any ablation showing that the FP rate persists under different scanner prompts or skill corpora; if this rate is judge- or prompt-specific, the “undetectable” margin over YAML baselines does not follow.

    Authors: The high FP rate is an empirical result from the evaluated benchmarks and judges. We acknowledge that an ablation on alternative prompts and corpora would strengthen generality claims. We will add this ablation (two new prompt variants and one additional corpus) in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on benchmarks

full rationale

The paper reports measured Attack Success Rates (89.3% ASR) and false-positive rates (74.6% on clean skills) from direct experiments on named benchmarks against explicit baselines (random-placement body, YAML-only). No equations, fitted parameters, self-citations, or derivations are present in the provided text; the central claims are observational comparisons that remain independent of any internal reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations or fitted parameters; the work is an empirical attack evaluation relying on domain assumptions about agent behavior and scanner sensitivity.

axioms (1)
  • domain assumption LLM agents execute skills containing blended privileged operations without flagging them as anomalous when the malicious content is positioned among setup steps.
    Implicit in the claim that body placement retains stealth while YAML does not; required for the attack success without task failure.

pith-pipeline@v0.9.1-grok · 5842 in / 1287 out tokens · 27842 ms · 2026-06-27T19:50:25.979754+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 19 canonical work pages · 13 internal anchors

  1. [1]

    Transactions of the association for computational linguistics , volume=

    Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=

  2. [2]

    Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

    Skill-inject: Measuring agent vulnerability to skill file attacks , author=. arXiv preprint arXiv:2602.20156 , year=

  3. [3]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    SkillsBench: Benchmarking how well agent skills work across diverse tasks , author=. arXiv preprint arXiv:2602.12670 , year=

  4. [4]

    Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=

    Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=

  5. [5]

    NeurIPS ML Safety Workshop , year=

    Ignore Previous Prompt: Attack Techniques For Language Models , author=. NeurIPS ML Safety Workshop , year=

  6. [6]

    2025 , howpublished =

    Equipping Agents for the Real World with Agent Skills , author =. 2025 , howpublished =

  7. [7]

    2023 , html =

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , html =

  8. [8]

    Advances in neural information processing systems , volume=

    Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

  9. [9]

    International Conference on Learning Representations , volume=

    Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. International Conference on Learning Representations , volume=

  10. [10]

    Intrinsically-Motivated and Open-Ended Learning Workshop @NeurIPS2023 , year=

    Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Intrinsically-Motivated and Open-Ended Learning Workshop @NeurIPS2023 , year=

  11. [11]

    2023 IEEE Symposium on Security and Privacy (SP) , pages=

    Sok: Taxonomy of attacks on open-source software supply chains , author=. 2023 IEEE Symposium on Security and Privacy (SP) , pages=. 2023 , organization=

  12. [12]

    International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment , pages=

    Backstabber’s knife collection: A review of open source software supply chain attacks , author=. International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment , pages=. 2020 , organization=

  13. [13]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  14. [14]

    33rd USENIX Security Symposium (USENIX Security 24) , pages=

    Formalizing and benchmarking prompt injection attacks and defenses , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=

  15. [15]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  16. [16]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sparks of Artificial General Intelligence: Early experiments with GPT-4 , author=. arXiv preprint arXiv:2303.12712 , year=

  17. [17]

    2026 , howpublished =

    Codex CLI , author =. 2026 , howpublished =

  18. [18]

    2026 , howpublished =

    Claude Code Documentation , author =. 2026 , howpublished =

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    International Conference on Learning Representations , volume=

    Agentharm: A benchmark for measuring harmfulness of llm agents , author=. International Conference on Learning Representations , volume=

  21. [21]

    arXiv preprint arXiv:2510.02554 , year=

    Tooltweak: An attack on tool selection in llm-based agents , author=. arXiv preprint arXiv:2510.02554 , year=

  22. [22]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Mcptox: A benchmark for tool poisoning on real-world mcp servers , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  23. [23]

    The 6th Workshop of Adversarial Machine Learning on Computer Vision: Safety of Vision-Language Agents , year=

    Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement , author=. The 6th Workshop of Adversarial Machine Learning on Computer Vision: Safety of Vision-Language Agents , year=

  24. [24]

    Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

    Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems , author=. arXiv preprint arXiv:2604.03081 , year=

  25. [25]

    SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement

    Skillattack: Automated red teaming of agent skills through attack path refinement , author=. arXiv preprint arXiv:2604.04989 , year=

  26. [26]

    First Workshop on Agent Skills , year=

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward , author=. First Workshop on Agent Skills , year=

  27. [27]

    Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

    Towards secure agent skills: Architecture, threat taxonomy, and security analysis , author=. arXiv preprint arXiv:2604.02837 , year=

  28. [28]

    "Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills in the Wild

    Malicious agent skills in the wild: A large-scale security empirical study , author=. arXiv preprint arXiv:2602.06547 , year=

  29. [29]

    Exploiting LLM Agent Supply Chains via Payload-less Skills

    Exploiting LLM Agent Supply Chains via Payload-less Skills , author=. arXiv preprint arXiv:2605.14460 , year=

  30. [30]

    Prompt Injection attack against LLM-integrated Applications

    Prompt injection attack against llm-integrated applications , author=. arXiv preprint arXiv:2306.05499 , year=

  31. [31]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

  32. [32]

    do anything now

    " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

  33. [33]

    arXiv preprint arXiv:2506.02456 , year=

    Vpi-bench: Visual prompt injection attacks for computer-use agents , author=. arXiv preprint arXiv:2506.02456 , year=

  34. [34]

    Advances in Neural Information Processing Systems , volume=

    Memory injection attacks on LLM agents via query-only interaction , author=. Advances in Neural Information Processing Systems , volume=

  35. [35]

    arXiv preprint arXiv:2601.05504 , year=

    Memory poisoning attack and defense on memory based llm-agents , author=. arXiv preprint arXiv:2601.05504 , year=

  36. [36]

    34th USENIX Security Symposium (USENIX Security 25) , pages=

    \ StruQ \ : Defending against prompt injection with structured queries , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=

  37. [37]

    European Symposium on Research in Computer Security , pages=

    Jatmo: Prompt injection defense by task-specific finetuning , author=. European Symposium on Research in Computer Security , pages=. 2024 , organization=

  38. [38]

    Defending Against Indirect Prompt Injection Attacks With Spotlighting

    Defending against indirect prompt injection attacks with spotlighting , author=. arXiv preprint arXiv:2403.14720 , year=

  39. [39]

    AIP Conference Proceedings , volume=

    Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications , author=. AIP Conference Proceedings , volume=. 2024 , organization=

  40. [40]

    Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

    Towards Reverse Engineering of Language Models: A Survey , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

  41. [41]

    arXiv preprint arXiv:2509.24566 , year=

    Tokenswap: Backdoor attack on the compositional understanding of large vision-language models , author=. arXiv preprint arXiv:2509.24566 , year=

  42. [42]

    Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

    Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation , author=. arXiv preprint arXiv:2603.17368 , year=

  43. [43]

    Advances in Neural Information Processing Systems , volume=

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases , author=. Advances in Neural Information Processing Systems , volume=

  44. [44]

    34th USENIX Security Symposium (USENIX Security 25) , pages=

    \ PoisonedRAG \ : Knowledge corruption attacks to \ Retrieval-Augmented \ generation of large language models , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=

  45. [45]

    Advances in Neural Information Processing Systems , volume=

    Watch out for your agents! investigating backdoor threats to llm-based agents , author=. Advances in Neural Information Processing Systems , volume=

  46. [46]

    arXiv preprint arXiv:2402.08567 , year=

    Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast , author=. arXiv preprint arXiv:2402.08567 , year=

  47. [47]

    Coercing LLMs to do and reveal (almost) anything,

    Coercing llms to do and reveal (almost) anything , author=. arXiv preprint arXiv:2402.14020 , year=

  48. [48]

    CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

    Coevoskills: Self-evolving agent skills via co-evolutionary verification , author=. arXiv preprint arXiv:2604.01687 , year=