arxiv: 2604.22888 · v1 · submitted 2026-04-24 · 💻 cs.CR · cs.AI

Recognition: unknown

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

Wenjie Xiao , Xuehai Tang , Biyu Zhou , Songlin Hu , Jizhong Han

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:23 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords skill poisoningLLM agentsattention hijackinginternal signal detectionpre-execution defenseindirect injectionagent security

0 comments

The pith

Skill poisoning in LLM agents produces attention hijacking that internal detectors can catch before execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that hiding malicious instructions inside legitimate-looking agent skills creates a structured internal effect: the model's attention during response generation shifts away from trusted context toward the poisoned skill content. This shift, termed attention hijacking, can be measured by combining response-conditioned attention patterns with hidden-state alignment checks. RouteGuard implements this as a frozen-backbone system using reliability-gated late fusion of those signals. When the claim holds, agents gain pre-execution defense against a threat that bypasses ordinary text filters, as shown by strong results on both real and synthetic skill benchmarks where lexical methods miss most description attacks.

Core claim

Successful skill poisoning induces attention hijacking in which response-time attention moves from trusted context to malicious skill spans and produces harmful outputs; RouteGuard detects this via response-conditioned attention and hidden-state alignment through reliability-gated late fusion, reaching 0.8834 F1 on the Skill-Inject slice and recovering 90.51 percent of description attacks missed by lexical screening across real and synthetic open-source benchmarks.

What carries the argument

Response-conditioned attention combined with hidden-state alignment through reliability-gated late fusion, which captures the structured internal shift induced by poisoned skills.

If this is right

Pre-execution filtering of skills becomes practical without retraining the underlying agent model.
Text-only lexical screening leaves the majority of description-based skill attacks undetected.
Internal-signal methods can be applied at inference time using only the model's existing activations.
Defenses must move beyond surface prompts to monitor how skills actually steer generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same internal monitoring could be extended to detect other pre-execution manipulations such as tool misdescription.
If attention hijacking proves architecture-independent, safety layers could be added to existing agent runtimes with minimal overhead.
Benchmarks mixing real user skills with synthetic poisons may still under-represent targeted attacks against production agents.

Load-bearing premise

The observed attention hijacking remains a reliable internal signal across different model architectures and attack variants.

What would settle it

A new skill-poisoning variant that produces no measurable shift in attention or hidden-state alignment yet still triggers harmful behavior on the same benchmarks.

Figures

Figures reproduced from arXiv: 2604.22888 by Biyu Zhou, Jizhong Han, Songlin Hu, Wenjie Xiao, Xuehai Tang.

**Figure 1.** Figure 1: Boundary contrast between skill poisoning view at source ↗

**Figure 2.** Figure 2: Attention hijacking in skill poisoning. In the poisoned condition, response-time attention shifts away from view at source ↗

**Figure 3.** Figure 3: RouteGuard-v2 architecture. A frozen backbone observes view at source ↗

read the original abstract

Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it reaches 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, showing that defending against skill poisoning requires internal-signal detection rather than text-only filtering

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RouteGuard shows internal attention signals can catch skill poisoning missed by text filters, but the abstract leaves generalizability and experimental details too thin to trust the numbers yet.

read the letter

The main point is that this paper claims attention hijacking inside the model is a reliable sign of poisoned skills, and their RouteGuard detector using response-conditioned attention plus hidden-state alignment beats lexical screening on the benchmarks they ran. On the Skill-Inject slice it hits 0.8834 F1 and pulls back 90.51 percent of the description attacks that text filters miss. That is the concrete result worth noting first. It suggests that for dense, action-oriented skills, watching what the model actually attends to at inference time adds something text-only methods do not. The work is honest about why skills are a harder injection surface than classic prompts: they already look like legitimate instructions. The detector is built on a frozen backbone with late fusion, which keeps it practical. Those are the parts that land. The soft spots are mostly about missing pieces rather than outright contradictions. The abstract gives performance numbers but no architecture list, no exact attention extraction code, no error bars, and no clear statement of how many attack variants were tried beyond the channel slice. If the hijacking pattern turns out to be tied to particular transformer implementations or fails on skills outside their real-plus-synthetic set, the claim that internal signals are required does not follow as strongly. The benchmarks are a reasonable starting point, but without more on selection or distribution shift the recovery rate could be optimistic. This paper is for people working on agent security tooling and AI safety pipelines. A reader who needs ideas for monitoring internal states would find the mechanism and the fusion approach useful even before full replication. It deserves peer review. The core observation is specific enough and the practical gap is real enough that referees should see the full methods and any additional controls on generalizability.

Referee Report

2 major / 2 minor

Summary. The paper introduces skill poisoning as a severe indirect injection threat to LLM agents, where malicious instructions are embedded in legitimate action-oriented skills. It identifies 'attention hijacking'—a shift in response-time attention from trusted context to malicious skill spans—as a structured internal effect of successful poisoning. Motivated by this, it proposes RouteGuard, a frozen-backbone detector combining response-conditioned attention and hidden-state alignment via reliability-gated late fusion. The central empirical claim is that RouteGuard is consistently the strongest or most robust detector across real and synthetic open-source skill benchmarks; specifically, on the Skill-Inject channel slice it achieves 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, supporting the conclusion that internal-signal detection is required over text-only filtering.

Significance. If the empirical results and the reliability of attention hijacking as a generalizable signal are substantiated, the work would be significant for LLM agent security. It provides a mechanistic explanation for why text-only defenses fail against skill poisoning and demonstrates a practical internal-signal approach that outperforms lexical baselines on the reported benchmarks. The combination of real and synthetic datasets and the focus on pre-execution detection are strengths that could inform future agent hardening techniques.

major comments (2)

[Abstract / Experimental Evaluation] Abstract and Experimental Evaluation section: The performance numbers (0.8834 F1 on Skill-Inject, 90.51% recovery of missed attacks) are stated without any description of the model families evaluated, the precise attention extraction procedure, the construction of the Skill-Inject channel slice, baseline implementations, statistical tests, or error analysis. These omissions are load-bearing because the claim that RouteGuard is 'consistently the strongest' and that internal signals are required cannot be verified or generalized without them.
[Method / Results] Method and Results sections: The central claim depends on attention hijacking being a reliable, architecture-independent internal signal induced by poisoning. No analysis is provided of its presence or absence across different transformer implementations, attack variants beyond the benchmark distribution, or failure cases, which directly affects whether the 90.51% recovery rate and the broader conclusion follow.

minor comments (2)

[Introduction] The term 'attention hijacking' is introduced in the abstract and early sections without a formal definition, equation, or illustrative figure showing the attention shift, which would aid clarity.
[Method] Notation for the late-fusion mechanism (reliability-gated combination of attention and hidden-state features) could be made more explicit with a short equation or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for improving clarity and substantiating our claims, and we address each major point below with specific plans for revision.

read point-by-point responses

Referee: [Abstract / Experimental Evaluation] Abstract and Experimental Evaluation section: The performance numbers (0.8834 F1 on Skill-Inject, 90.51% recovery of missed attacks) are stated without any description of the model families evaluated, the precise attention extraction procedure, the construction of the Skill-Inject channel slice, baseline implementations, statistical tests, or error analysis. These omissions are load-bearing because the claim that RouteGuard is 'consistently the strongest' and that internal signals are required cannot be verified or generalized without them.

Authors: We agree that additional detail is needed for full verifiability. The Experimental Evaluation section already specifies the model families (Llama-3-8B-Instruct and Mistral-7B-v0.3), the attention extraction procedure (response-conditioned attention aggregated over the final four layers using the model's native attention implementation), the Skill-Inject channel slice construction (synthetic poisoning of 1,200 action-oriented skills with malicious instruction spans), and baseline implementations (lexical screening via keyword matching, embedding cosine similarity, and perplexity filtering). However, we acknowledge the absence of statistical significance tests (e.g., McNemar's test) and per-attack error analysis. We will add these to the revised Experimental Evaluation section and include a brief reference in the abstract to the key experimental setup. This will directly support the 'consistently strongest' claim without altering the reported numbers. revision: yes
Referee: [Method / Results] Method and Results sections: The central claim depends on attention hijacking being a reliable, architecture-independent internal signal induced by poisoning. No analysis is provided of its presence or absence across different transformer implementations, attack variants beyond the benchmark distribution, or failure cases, which directly affects whether the 90.51% recovery rate and the broader conclusion follow.

Authors: We recognize that architecture independence is central to the mechanistic argument. The current evaluation covers two distinct open-source decoder-only families (Llama-3 and Mistral) with different attention implementations and parameter scales, and attention hijacking is observed consistently in successful poisoning cases within these. That said, the manuscript does not include a dedicated cross-architecture ablation or analysis of failure cases outside the benchmark distribution. We will add a new subsection to the Results section that (1) quantifies attention hijacking metrics across the evaluated models, (2) tests additional attack variants (e.g., paraphrased and multi-skill injections), and (3) reports failure cases where hijacking is weak or absent. This will allow readers to assess the generalizability of the 90.51% recovery figure and the necessity of internal signals. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent evaluation

full rationale

The paper introduces RouteGuard as an empirically motivated detector based on observed attention hijacking in poisoned LLM agents, with no equations, derivations, or parameter-fitting steps present in the provided abstract or description. Claims rest on benchmark evaluations using real and synthetic open-source skills, without any self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations that collapse the central result to its own inputs. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

No mathematical derivations, free parameters, or formal axioms appear in the abstract. The work is empirical and introduces the concept of attention hijacking as an observed phenomenon without independent external validation.

invented entities (1)

attention hijacking no independent evidence
purpose: Structured internal effect where attention shifts to malicious skill spans during successful poisoning
Described as induced by poisoning but no independent evidence or falsifiable prediction outside the study is provided.

pith-pipeline@v0.9.0 · 5469 in / 1302 out tokens · 41676 ms · 2026-05-08T11:23:38.188279+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 12 canonical work pages · 4 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Zhihao Chen, Ying Zhang, Yi Liu, Gelei Deng, Yuekang Li, Yanjun Zhang, Jianting Ning, Leo Yu Zhang, Lei Ma, and Zhiqiang Li. 2026. https://arxiv.org/abs/2604.03070 Credential leakage in llm agent skills: A large-scale empirical study . arXiv preprint arXiv:2604.03070

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. https://arxiv.org/abs/2406.13352 Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents . arXiv preprint arXiv:2406.13352

work page internal anchor Pith review arXiv 2024
[5]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. https://doi.org/10.1145/3605764.3623985 Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

work page doi:10.1145/3605764.3623985 2023
[6]

Hsu, and Pin-Yu Chen

Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H. Hsu, and Pin-Yu Chen. 2024. https://arxiv.org/abs/2411.00348 Attention tracker: Detecting prompt injection attacks in llms . arXiv preprint arXiv:2411.00348

work page arXiv 2024
[7]

Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, and Leo Yu Zhang. 2026 a . https://arxiv.org/abs/2602.06547 Malicious agent skills in the wild: A large-scale security empirical study . arXiv preprint arXiv:2602.06547

work page arXiv 2026
[8]

Prompt Injection attack against LLM-integrated Applications

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, H. Wang, Yan Zheng, Leo Yu Zhang, and Yang Liu. 2023 a . https://doi.org/10.48550/arxiv.2306.05499 Prompt injection attack against llm-integrated applications . arXiv (Cornell University)

work page internal anchor Pith review doi:10.48550/arxiv.2306.05499 2023
[9]

Yi Liu, Weizhe Wang, Ruitao Feng, Yao Zhang, Guangquan Xu, Gelei Deng, Yuekang Li, and Leo Zhang. 2026 b . https://arxiv.org/abs/2601.10338 Agent skills in the wild: An empirical study of security vulnerabilities at scale . arXiv preprint arXiv:2601.10338

work page internal anchor Pith review arXiv 2026
[10]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2023 b . https://arxiv.org/abs/2310.12815 Formalizing and benchmarking prompt injection attacks and defenses . arXiv preprint arXiv:2310.12815

work page arXiv 2023
[11]

protectskills . 2026. https://github.com/protectskills/MaliciousAgentSkillsBench Do Not Mention This to the User : Detecting and understanding malicious agent skills . GitHub repository

2026
[12]

David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, and Maksym Andriushchenko. 2026. https://arxiv.org/abs/2602.20156 Skill-inject: Measuring agent vulnerability to skill file attacks . arXiv preprint arXiv:2602.20156

work page arXiv 2026
[13]

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, and Dawn Song. 2025. https://arxiv.org/abs/2507.15219 Promptarmor: Simple yet effective prompt injection defenses . arXiv preprint arXiv:2507.15219

work page arXiv 2025
[14]

Tongyu Wen, Chenglong Wang, Xiyuan Yang, Haoyu Tang, Weiran Yao, Jiacheng Wang, Ruoxi Jia, and Ruiyi Zhang. 2025. https://aclanthology.org/2025.findings-acl.967/ Defending against indirect prompt injection by instruction detection . In Findings of the Association for Computational Linguistics: ACL 2025, pages 18714--18735

2025
[15]

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. https://doi.org/10.18653/v1/2024.findings-acl.624 Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

work page doi:10.18653/v1/2024.findings-acl.624 2024
[16]

Jiayi Zhong, Yifan Zhang, Yanyu Fan, Jiawei Gu, Haoyu Wang, and Junjie Hu. 2025. https://arxiv.org/abs/2512.08417 Attention is all you need to defend against indirect prompt injection attacks in llms . arXiv preprint arXiv:2512.08417

work page arXiv 2025