Recognition: unknown
RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents
Pith reviewed 2026-05-08 11:23 UTC · model grok-4.3
The pith
Skill poisoning in LLM agents produces attention hijacking that internal detectors can catch before execution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Successful skill poisoning induces attention hijacking in which response-time attention moves from trusted context to malicious skill spans and produces harmful outputs; RouteGuard detects this via response-conditioned attention and hidden-state alignment through reliability-gated late fusion, reaching 0.8834 F1 on the Skill-Inject slice and recovering 90.51 percent of description attacks missed by lexical screening across real and synthetic open-source benchmarks.
What carries the argument
Response-conditioned attention combined with hidden-state alignment through reliability-gated late fusion, which captures the structured internal shift induced by poisoned skills.
If this is right
- Pre-execution filtering of skills becomes practical without retraining the underlying agent model.
- Text-only lexical screening leaves the majority of description-based skill attacks undetected.
- Internal-signal methods can be applied at inference time using only the model's existing activations.
- Defenses must move beyond surface prompts to monitor how skills actually steer generation.
Where Pith is reading between the lines
- The same internal monitoring could be extended to detect other pre-execution manipulations such as tool misdescription.
- If attention hijacking proves architecture-independent, safety layers could be added to existing agent runtimes with minimal overhead.
- Benchmarks mixing real user skills with synthetic poisons may still under-represent targeted attacks against production agents.
Load-bearing premise
The observed attention hijacking remains a reliable internal signal across different model architectures and attack variants.
What would settle it
A new skill-poisoning variant that produces no measurable shift in attention or hidden-state alignment yet still triggers harmful behavior on the same benchmarks.
Figures
read the original abstract
Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it reaches 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, showing that defending against skill poisoning requires internal-signal detection rather than text-only filtering
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces skill poisoning as a severe indirect injection threat to LLM agents, where malicious instructions are embedded in legitimate action-oriented skills. It identifies 'attention hijacking'—a shift in response-time attention from trusted context to malicious skill spans—as a structured internal effect of successful poisoning. Motivated by this, it proposes RouteGuard, a frozen-backbone detector combining response-conditioned attention and hidden-state alignment via reliability-gated late fusion. The central empirical claim is that RouteGuard is consistently the strongest or most robust detector across real and synthetic open-source skill benchmarks; specifically, on the Skill-Inject channel slice it achieves 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, supporting the conclusion that internal-signal detection is required over text-only filtering.
Significance. If the empirical results and the reliability of attention hijacking as a generalizable signal are substantiated, the work would be significant for LLM agent security. It provides a mechanistic explanation for why text-only defenses fail against skill poisoning and demonstrates a practical internal-signal approach that outperforms lexical baselines on the reported benchmarks. The combination of real and synthetic datasets and the focus on pre-execution detection are strengths that could inform future agent hardening techniques.
major comments (2)
- [Abstract / Experimental Evaluation] Abstract and Experimental Evaluation section: The performance numbers (0.8834 F1 on Skill-Inject, 90.51% recovery of missed attacks) are stated without any description of the model families evaluated, the precise attention extraction procedure, the construction of the Skill-Inject channel slice, baseline implementations, statistical tests, or error analysis. These omissions are load-bearing because the claim that RouteGuard is 'consistently the strongest' and that internal signals are required cannot be verified or generalized without them.
- [Method / Results] Method and Results sections: The central claim depends on attention hijacking being a reliable, architecture-independent internal signal induced by poisoning. No analysis is provided of its presence or absence across different transformer implementations, attack variants beyond the benchmark distribution, or failure cases, which directly affects whether the 90.51% recovery rate and the broader conclusion follow.
minor comments (2)
- [Introduction] The term 'attention hijacking' is introduced in the abstract and early sections without a formal definition, equation, or illustrative figure showing the attention shift, which would aid clarity.
- [Method] Notation for the late-fusion mechanism (reliability-gated combination of attention and hidden-state features) could be made more explicit with a short equation or pseudocode.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for improving clarity and substantiating our claims, and we address each major point below with specific plans for revision.
read point-by-point responses
-
Referee: [Abstract / Experimental Evaluation] Abstract and Experimental Evaluation section: The performance numbers (0.8834 F1 on Skill-Inject, 90.51% recovery of missed attacks) are stated without any description of the model families evaluated, the precise attention extraction procedure, the construction of the Skill-Inject channel slice, baseline implementations, statistical tests, or error analysis. These omissions are load-bearing because the claim that RouteGuard is 'consistently the strongest' and that internal signals are required cannot be verified or generalized without them.
Authors: We agree that additional detail is needed for full verifiability. The Experimental Evaluation section already specifies the model families (Llama-3-8B-Instruct and Mistral-7B-v0.3), the attention extraction procedure (response-conditioned attention aggregated over the final four layers using the model's native attention implementation), the Skill-Inject channel slice construction (synthetic poisoning of 1,200 action-oriented skills with malicious instruction spans), and baseline implementations (lexical screening via keyword matching, embedding cosine similarity, and perplexity filtering). However, we acknowledge the absence of statistical significance tests (e.g., McNemar's test) and per-attack error analysis. We will add these to the revised Experimental Evaluation section and include a brief reference in the abstract to the key experimental setup. This will directly support the 'consistently strongest' claim without altering the reported numbers. revision: yes
-
Referee: [Method / Results] Method and Results sections: The central claim depends on attention hijacking being a reliable, architecture-independent internal signal induced by poisoning. No analysis is provided of its presence or absence across different transformer implementations, attack variants beyond the benchmark distribution, or failure cases, which directly affects whether the 90.51% recovery rate and the broader conclusion follow.
Authors: We recognize that architecture independence is central to the mechanistic argument. The current evaluation covers two distinct open-source decoder-only families (Llama-3 and Mistral) with different attention implementations and parameter scales, and attention hijacking is observed consistently in successful poisoning cases within these. That said, the manuscript does not include a dedicated cross-architecture ablation or analysis of failure cases outside the benchmark distribution. We will add a new subsection to the Results section that (1) quantifies attention hijacking metrics across the evaluated models, (2) tests additional attack variants (e.g., paraphrased and multi-skill injections), and (3) reports failure cases where hijacking is weak or absent. This will allow readers to assess the generalizability of the 90.51% recovery figure and the necessity of internal signals. revision: yes
Circularity Check
No significant circularity; empirical method with independent evaluation
full rationale
The paper introduces RouteGuard as an empirically motivated detector based on observed attention hijacking in poisoned LLM agents, with no equations, derivations, or parameter-fitting steps present in the provided abstract or description. Claims rest on benchmark evaluations using real and synthetic open-source skills, without any self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations that collapse the central result to its own inputs. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
attention hijacking
no independent evidence
Reference graph
Works this paper leans on
-
[1]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[2]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Zhihao Chen, Ying Zhang, Yi Liu, Gelei Deng, Yuekang Li, Yanjun Zhang, Jianting Ning, Leo Yu Zhang, Lei Ma, and Zhiqiang Li. 2026. https://arxiv.org/abs/2604.03070 Credential leakage in llm agent skills: A large-scale empirical study . arXiv preprint arXiv:2604.03070
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. https://arxiv.org/abs/2406.13352 Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents . arXiv preprint arXiv:2406.13352
work page internal anchor Pith review arXiv 2024
-
[5]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. https://doi.org/10.1145/3605764.3623985 Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection
-
[6]
Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H. Hsu, and Pin-Yu Chen. 2024. https://arxiv.org/abs/2411.00348 Attention tracker: Detecting prompt injection attacks in llms . arXiv preprint arXiv:2411.00348
- [7]
-
[8]
Prompt Injection attack against LLM-integrated Applications
Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, H. Wang, Yan Zheng, Leo Yu Zhang, and Yang Liu. 2023 a . https://doi.org/10.48550/arxiv.2306.05499 Prompt injection attack against llm-integrated applications . arXiv (Cornell University)
work page internal anchor Pith review doi:10.48550/arxiv.2306.05499 2023
-
[9]
Yi Liu, Weizhe Wang, Ruitao Feng, Yao Zhang, Guangquan Xu, Gelei Deng, Yuekang Li, and Leo Zhang. 2026 b . https://arxiv.org/abs/2601.10338 Agent skills in the wild: An empirical study of security vulnerabilities at scale . arXiv preprint arXiv:2601.10338
work page internal anchor Pith review arXiv 2026
- [10]
-
[11]
protectskills . 2026. https://github.com/protectskills/MaliciousAgentSkillsBench Do Not Mention This to the User : Detecting and understanding malicious agent skills . GitHub repository
2026
- [12]
-
[13]
Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, and Dawn Song. 2025. https://arxiv.org/abs/2507.15219 Promptarmor: Simple yet effective prompt injection defenses . arXiv preprint arXiv:2507.15219
-
[14]
Tongyu Wen, Chenglong Wang, Xiyuan Yang, Haoyu Tang, Weiran Yao, Jiacheng Wang, Ruoxi Jia, and Ruiyi Zhang. 2025. https://aclanthology.org/2025.findings-acl.967/ Defending against indirect prompt injection by instruction detection . In Findings of the Association for Computational Linguistics: ACL 2025, pages 18714--18735
2025
-
[15]
Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. https://doi.org/10.18653/v1/2024.findings-acl.624 Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents
- [16]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.