VATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation
Pith reviewed 2026-06-27 19:56 UTC · model grok-4.3
The pith
Tool error messages carry implicit authority that mutated injections exploit to triple indirect prompt injection success in AI agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tool error messages possess implicit authority that triggers corrective reasoning modes bypassing standard safety heuristics, allowing systematic mutation of payloads in the error-handling loop to achieve error-path injection that triples the success rate of indirect prompt injection and reaches up to 100 percent compliance.
What carries the argument
VATS, a mutation-driven framework that systematically evolves adversarial payloads across seven structural and linguistic dimensions and isolates structural positioning as the strongest vector.
If this is right
- Production framework guardrails can mitigate these vulnerabilities.
- The model layer itself remains susceptible, creating systemic risk for bespoke agentic workflows.
- Structural positioning of instructions within error context is the most effective exploit vector across all tested models.
Where Pith is reading between the lines
- Agent designs may need separate safety checks on error content that differ from checks on normal user input.
- Models could be retrained to treat tool errors strictly as diagnostic data rather than authoritative directives.
- The same mutation approach might expose similar weaknesses in other feedback loops such as retry messages or status updates.
Load-bearing premise
Tool error messages possess implicit authority that causes models to enter corrective reasoning modes which ignore safety rules.
What would settle it
Replace the mutated error messages with neutral non-authoritative text while keeping the same injected instructions and measure whether attack success falls back to the level of standard indirect prompt injection.
Figures
read the original abstract
As the Model Context Protocol (MCP) standardizes tool-calling for autonomous agents, it introduces a critical, unexamined attack surface: the error-handling loop. We hypothesize that tool error messages possess implicit authority, triggering corrective reasoning modes that bypass standard safety heuristics. We introduce VATS (Vulnerability Analysis of Tool Streams), a mutation-driven framework that systematically evolves adversarial payloads across seven structural and linguistic dimensions. Our evaluation across four frontier models, Gemini 3.1 Pro, GPT-5.5, GLM-5.1, and Qwen3-Coder, demonstrates that error-path injection triples the success rate of standard indirect prompt injection (IPI), achieving up to 100% compliance in controlled evaluations. We isolate structural positioning (sandwiching instructions within error context) as the most effective exploit vector across all tested models. While we find that production framework guardrails can mitigate these vulnerabilities, the inherent susceptibility of the model layer poses a systemic risk to bespoke agentic workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VATS, a mutation-driven framework that evolves adversarial payloads across seven structural and linguistic dimensions to perform error-path injection in the Model Context Protocol (MCP) error-handling loop for autonomous agents. It hypothesizes that tool error messages carry implicit authority that triggers corrective reasoning and bypasses safety heuristics. The central empirical claim is that this approach triples the success rate of standard indirect prompt injection (IPI) and reaches up to 100% compliance on four frontier models (Gemini 3.1 Pro, GPT-5.5, GLM-5.1, Qwen3-Coder), with structural positioning (sandwiching) identified as the strongest vector; production guardrails are said to mitigate but not eliminate the model-layer risk.
Significance. If the quantitative results and attribution to error-path authority hold after proper controls, the work would identify a previously unexamined attack surface in standardized agent tool-calling protocols and supply a systematic, extensible method for discovering such vulnerabilities. The multi-model evaluation and isolation of structural positioning are potential strengths for reproducibility and follow-on research in AI agent security.
major comments (2)
- [Abstract] Abstract: The quantitative claims that error-path injection 'triples the success rate of standard indirect prompt injection (IPI)' and achieves 'up to 100% compliance' are stated without any description of experimental design, number of trials per condition, definition of success/compliance, baseline IPI success rates, or statistical methods. This absence prevents verification that the data support the stated effect sizes.
- [Abstract and Evaluation] Abstract and Evaluation: The reported performance gain is attributed to the implicit authority of tool error messages, yet the evaluation only contrasts error-path injection against standard IPI. No ablation is described in which the same VATS-mutated payloads are placed in non-error contexts; without this control it is impossible to determine whether the tripling is caused by the error framing or by the mutation framework itself.
minor comments (1)
- [Abstract] The abstract refers to 'seven structural and linguistic dimensions' but does not enumerate them; an explicit list or table would improve reproducibility even if the full details appear later.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract and evaluation design. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The quantitative claims that error-path injection 'triples the success rate of standard indirect prompt injection (IPI)' and achieves 'up to 100% compliance' are stated without any description of experimental design, number of trials per condition, definition of success/compliance, baseline IPI success rates, or statistical methods. This absence prevents verification that the data support the stated effect sizes.
Authors: We agree that the abstract would benefit from additional methodological context to allow readers to assess the claims more readily. The full evaluation section reports the relevant details (trial counts, success definitions, baselines, and statistical approach), but these are not summarized in the abstract. In the revised manuscript we will expand the abstract to include a concise description of the experimental design, number of trials, success criteria, and baseline rates. revision: yes
-
Referee: [Abstract and Evaluation] Abstract and Evaluation: The reported performance gain is attributed to the implicit authority of tool error messages, yet the evaluation only contrasts error-path injection against standard IPI. No ablation is described in which the same VATS-mutated payloads are placed in non-error contexts; without this control it is impossible to determine whether the tripling is caused by the error framing or by the mutation framework itself.
Authors: This observation is correct. The current evaluation isolates the effect of the error-path context by comparing VATS-augmented error messages against standard IPI, but does not include an ablation that applies the identical VATS-mutated payloads outside error contexts. Such a control would more cleanly attribute gains to the hypothesized authority of error messages versus the mutation framework alone. We will add this ablation experiment to the revised evaluation section. revision: yes
Circularity Check
No circularity: empirical evaluation of mutation framework stands independently
full rationale
The paper presents a hypothesis about error messages and introduces VATS as a systematic mutation framework, then reports direct experimental comparisons of success rates on four frontier models. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear. The central results are framed as measured outcomes from controlled evaluations rather than quantities derived from the inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
URL https://arxiv.org/abs/ 2604.20994. Cartagena, A. and Teixeira, A. Mind the gap: Text safety does not transfer to tool-call safety in llm agents.arXiv preprint arXiv:2602.16943,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems
Chang, Z., Li, M., Jia, X., Wang, J., Huang, Y ., Jiang, Z., Liu, Y ., and Wang, Q. One shot dominance: Knowl- edge poisoning attack on retrieval-augmented generation systems.arXiv preprint arXiv:2505.11548,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Cohen, A. In-browser llm-guided fuzzing for real-time prompt injection testing in agentic ai browsers.arXiv preprint arXiv:2510.13543,
-
[4]
MCP adoption statistics 2026: Model context protocol, April
Digital Applied Team. MCP adoption statistics 2026: Model context protocol, April
2026
-
[5]
Geng, Y ., Li, H., Mu, H., Han, X., Baldwin, T., Abend, O., Hovy, E., and Frermann, L
URL https://www.digitalapplied.com/blog/ mcp-adoption-statistics-2026-model- context-protocol. Geng, Y ., Li, H., Mu, H., Han, X., Baldwin, T., Abend, O., Hovy, E., and Frermann, L. Control illusion: The failure of instruction hierarchies in large language mod- els. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pp. 30816–30824,
2026
-
[6]
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents
URL https://arxiv.org/abs/2512.23128. 5 V ATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation Lin, J., Zhou, Z., Zheng, Z., Liu, S., Xu, T., Chen, Y ., and Chen, E. Vigil: Defending llm agents against tool stream injection via verify-before-commit.arXiv preprint arXiv:2601.05755,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Liu, Y ., Wang, W., Feng, R., Zhang, Y ., Xu, G., Deng, G., Li, Y ., and Zhang, L
URL https://arxiv.org/abs/2406.03807. Liu, Y ., Wang, W., Feng, R., Zhang, Y ., Xu, G., Deng, G., Li, Y ., and Zhang, L. Agent skills in the wild: An empirical study of security vulnerabilities at scale.arXiv preprint arXiv:2601.10338,
-
[8]
URL https://arxiv.org/abs/ 2601.17549. Model Context Protocol. Model context protocol specifi- cation. https://modelcontextprotocol.io/ specification/2025-11-25, nov
-
[9]
Version 2025-11-
URL https://modelcontextprotocol.io/ specification/2025-11-25. Version 2025-11-
2025
-
[10]
Accessed: 2026-05-06. OpenAI. codex: Lightweight coding agent that runs in your terminal,
2026
-
[11]
Pai, K., Shah, P., and Patel, H
URLhttps://arxiv.org/abs/2506.04255. Pai, K., Shah, P., and Patel, H. Naamse: Framework for evolutionary security evaluation of agents,
-
[12]
URL https://arxiv.org/abs/2602.07391. Qin, Y ., Liang, S., Ye, Y ., Zhu, K., Yan, L., Lu, Y ., Lin, Y ., Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitat- ing large language models to master 16000+ real-world apis. InThe twelfth international conference on learning representations,
-
[13]
Shi, G., Du, H., Wang, Z., Liang, X., Liu, W., Bian, S., and Guan, Z
URL https: //fordelstudios.com/research/mcp- production-engineering-guide. Shi, G., Du, H., Wang, Z., Liang, X., Liu, W., Bian, S., and Guan, Z. Sok: Trust-authorization mismatch in llm agent interactions.arXiv preprint arXiv:2512.06914,
-
[14]
Wang, Z., Jiang, Y ., Yu, J., and Huang, H. The illusion of role separation: Hidden shortcuts in llm role learning (and how to fix them).arXiv preprint arXiv:2505.00626,
-
[15]
ReAct: Synergizing Reasoning and Acting in Language Models
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Injeca- gent: Benchmarking indirect prompt injections in tool- integrated large language model agents
Zhan, Q., Liang, Z., Ying, Z., and Kang, D. Injeca- gent: Benchmarking indirect prompt injections in tool- integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 10471–10506,
2024
-
[17]
S., and Kang, D
Zhan, Q., Fang, R., Panchal, H. S., and Kang, D. Adaptive attacks break defenses against indirect prompt injection attacks on llm agents. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 7101– 7117,
2025
-
[18]
status":
6 V ATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation A. Mutation Examples We show the seed error payload and representative mutations across each dimension. Each example shows the exact JSON returned to the agent as a tool error response. A.1. Seed Payload The starting point for all mutations. Imperative framing, no autho...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.