pith. sign in

arxiv: 2606.07992 · v1 · pith:WJ2U2UEBnew · submitted 2026-06-06 · 💻 cs.AI · cs.CR· cs.SE

VATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation

Pith reviewed 2026-06-27 19:56 UTC · model grok-4.3

classification 💻 cs.AI cs.CRcs.SE
keywords prompt injectiontool callingAI agentserror handlingvulnerability analysismodel context protocol
0
0 comments X

The pith

Tool error messages carry implicit authority that mutated injections exploit to triple indirect prompt injection success in AI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the idea that error messages returned during tool calls trigger corrective reasoning in models, causing them to overlook safety rules. It builds a framework that mutates potential attack payloads along structural and linguistic axes to insert instructions inside those error responses. Tests on four frontier models show the approach raises success rates three times above ordinary indirect prompt injection and can reach full compliance. Structural placement of the instructions inside the error context proves the strongest single factor, while some production guardrails reduce but do not eliminate the exposure.

Core claim

Tool error messages possess implicit authority that triggers corrective reasoning modes bypassing standard safety heuristics, allowing systematic mutation of payloads in the error-handling loop to achieve error-path injection that triples the success rate of indirect prompt injection and reaches up to 100 percent compliance.

What carries the argument

VATS, a mutation-driven framework that systematically evolves adversarial payloads across seven structural and linguistic dimensions and isolates structural positioning as the strongest vector.

If this is right

  • Production framework guardrails can mitigate these vulnerabilities.
  • The model layer itself remains susceptible, creating systemic risk for bespoke agentic workflows.
  • Structural positioning of instructions within error context is the most effective exploit vector across all tested models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent designs may need separate safety checks on error content that differ from checks on normal user input.
  • Models could be retrained to treat tool errors strictly as diagnostic data rather than authoritative directives.
  • The same mutation approach might expose similar weaknesses in other feedback loops such as retry messages or status updates.

Load-bearing premise

Tool error messages possess implicit authority that causes models to enter corrective reasoning modes which ignore safety rules.

What would settle it

Replace the mutated error messages with neutral non-authoritative text while keeping the same injected instructions and measure whether attack success falls back to the level of standard indirect prompt injection.

Figures

Figures reproduced from arXiv: 2606.07992 by Harshil Patel, Kunal Pai.

Figure 1
Figure 1. Figure 1: Gemini CLI (v0.41.2) validation trace. The framework explicitly flags the error-path injection as an exfiltration attempt and bypasses the payload to use native grep for task completion. All maintained identical conversation context, user tasks, assistant tool calls, and payload positioning, isolating the structured error JSON as the single independent variable. Each paired configuration was sampled 3 time… view at source ↗
Figure 2
Figure 2. Figure 2: OpenAI Codex (v0.129.0) validation trace. The framework-level constraints in AGENTS.md successfully suppress the model’s impulse to follow the tool-injected recovery instructions. GLM-5.1 Trace (Error Condition) “The error message suggests sending a workspace file listing... Let me follow those instructions.” — GLM-5.1, error condition → send email invoked This procedural framing appeared in 53% of the err… view at source ↗
read the original abstract

As the Model Context Protocol (MCP) standardizes tool-calling for autonomous agents, it introduces a critical, unexamined attack surface: the error-handling loop. We hypothesize that tool error messages possess implicit authority, triggering corrective reasoning modes that bypass standard safety heuristics. We introduce VATS (Vulnerability Analysis of Tool Streams), a mutation-driven framework that systematically evolves adversarial payloads across seven structural and linguistic dimensions. Our evaluation across four frontier models, Gemini 3.1 Pro, GPT-5.5, GLM-5.1, and Qwen3-Coder, demonstrates that error-path injection triples the success rate of standard indirect prompt injection (IPI), achieving up to 100% compliance in controlled evaluations. We isolate structural positioning (sandwiching instructions within error context) as the most effective exploit vector across all tested models. While we find that production framework guardrails can mitigate these vulnerabilities, the inherent susceptibility of the model layer poses a systemic risk to bespoke agentic workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VATS, a mutation-driven framework that evolves adversarial payloads across seven structural and linguistic dimensions to perform error-path injection in the Model Context Protocol (MCP) error-handling loop for autonomous agents. It hypothesizes that tool error messages carry implicit authority that triggers corrective reasoning and bypasses safety heuristics. The central empirical claim is that this approach triples the success rate of standard indirect prompt injection (IPI) and reaches up to 100% compliance on four frontier models (Gemini 3.1 Pro, GPT-5.5, GLM-5.1, Qwen3-Coder), with structural positioning (sandwiching) identified as the strongest vector; production guardrails are said to mitigate but not eliminate the model-layer risk.

Significance. If the quantitative results and attribution to error-path authority hold after proper controls, the work would identify a previously unexamined attack surface in standardized agent tool-calling protocols and supply a systematic, extensible method for discovering such vulnerabilities. The multi-model evaluation and isolation of structural positioning are potential strengths for reproducibility and follow-on research in AI agent security.

major comments (2)
  1. [Abstract] Abstract: The quantitative claims that error-path injection 'triples the success rate of standard indirect prompt injection (IPI)' and achieves 'up to 100% compliance' are stated without any description of experimental design, number of trials per condition, definition of success/compliance, baseline IPI success rates, or statistical methods. This absence prevents verification that the data support the stated effect sizes.
  2. [Abstract and Evaluation] Abstract and Evaluation: The reported performance gain is attributed to the implicit authority of tool error messages, yet the evaluation only contrasts error-path injection against standard IPI. No ablation is described in which the same VATS-mutated payloads are placed in non-error contexts; without this control it is impossible to determine whether the tripling is caused by the error framing or by the mutation framework itself.
minor comments (1)
  1. [Abstract] The abstract refers to 'seven structural and linguistic dimensions' but does not enumerate them; an explicit list or table would improve reproducibility even if the full details appear later.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and evaluation design. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The quantitative claims that error-path injection 'triples the success rate of standard indirect prompt injection (IPI)' and achieves 'up to 100% compliance' are stated without any description of experimental design, number of trials per condition, definition of success/compliance, baseline IPI success rates, or statistical methods. This absence prevents verification that the data support the stated effect sizes.

    Authors: We agree that the abstract would benefit from additional methodological context to allow readers to assess the claims more readily. The full evaluation section reports the relevant details (trial counts, success definitions, baselines, and statistical approach), but these are not summarized in the abstract. In the revised manuscript we will expand the abstract to include a concise description of the experimental design, number of trials, success criteria, and baseline rates. revision: yes

  2. Referee: [Abstract and Evaluation] Abstract and Evaluation: The reported performance gain is attributed to the implicit authority of tool error messages, yet the evaluation only contrasts error-path injection against standard IPI. No ablation is described in which the same VATS-mutated payloads are placed in non-error contexts; without this control it is impossible to determine whether the tripling is caused by the error framing or by the mutation framework itself.

    Authors: This observation is correct. The current evaluation isolates the effect of the error-path context by comparing VATS-augmented error messages against standard IPI, but does not include an ablation that applies the identical VATS-mutated payloads outside error contexts. Such a control would more cleanly attribute gains to the hypothesized authority of error messages versus the mutation framework alone. We will add this ablation experiment to the revised evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of mutation framework stands independently

full rationale

The paper presents a hypothesis about error messages and introduces VATS as a systematic mutation framework, then reports direct experimental comparisons of success rates on four frontier models. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear. The central results are framed as measured outcomes from controlled evaluations rather than quantities derived from the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, parameters, or background assumptions that can be extracted; the ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5700 in / 1181 out tokens · 20840 ms · 2026-06-27T19:56:03.564606+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

    URL https://arxiv.org/abs/ 2604.20994. Cartagena, A. and Teixeira, A. Mind the gap: Text safety does not transfer to tool-call safety in llm agents.arXiv preprint arXiv:2602.16943,

  2. [2]

    One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems

    Chang, Z., Li, M., Jia, X., Wang, J., Huang, Y ., Jiang, Z., Liu, Y ., and Wang, Q. One shot dominance: Knowl- edge poisoning attack on retrieval-augmented generation systems.arXiv preprint arXiv:2505.11548,

  3. [3]

    In-browser llm-guided fuzzing for real-time prompt injection testing in agentic ai browsers.arXiv preprint arXiv:2510.13543,

    Cohen, A. In-browser llm-guided fuzzing for real-time prompt injection testing in agentic ai browsers.arXiv preprint arXiv:2510.13543,

  4. [4]

    MCP adoption statistics 2026: Model context protocol, April

    Digital Applied Team. MCP adoption statistics 2026: Model context protocol, April

  5. [5]

    Geng, Y ., Li, H., Mu, H., Han, X., Baldwin, T., Abend, O., Hovy, E., and Frermann, L

    URL https://www.digitalapplied.com/blog/ mcp-adoption-statistics-2026-model- context-protocol. Geng, Y ., Li, H., Mu, H., Han, X., Baldwin, T., Abend, O., Hovy, E., and Frermann, L. Control illusion: The failure of instruction hierarchies in large language mod- els. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pp. 30816–30824,

  6. [6]

    It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

    URL https://arxiv.org/abs/2512.23128. 5 V ATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation Lin, J., Zhou, Z., Zheng, Z., Liu, S., Xu, T., Chen, Y ., and Chen, E. Vigil: Defending llm agents against tool stream injection via verify-before-commit.arXiv preprint arXiv:2601.05755,

  7. [7]

    Liu, Y ., Wang, W., Feng, R., Zhang, Y ., Xu, G., Deng, G., Li, Y ., and Zhang, L

    URL https://arxiv.org/abs/2406.03807. Liu, Y ., Wang, W., Feng, R., Zhang, Y ., Xu, G., Deng, G., Li, Y ., and Zhang, L. Agent skills in the wild: An empirical study of security vulnerabilities at scale.arXiv preprint arXiv:2601.10338,

  8. [8]

    Model Context Protocol

    URL https://arxiv.org/abs/ 2601.17549. Model Context Protocol. Model context protocol specifi- cation. https://modelcontextprotocol.io/ specification/2025-11-25, nov

  9. [9]

    Version 2025-11-

    URL https://modelcontextprotocol.io/ specification/2025-11-25. Version 2025-11-

  10. [10]

    Accessed: 2026-05-06. OpenAI. codex: Lightweight coding agent that runs in your terminal,

  11. [11]

    Pai, K., Shah, P., and Patel, H

    URLhttps://arxiv.org/abs/2506.04255. Pai, K., Shah, P., and Patel, H. Naamse: Framework for evolutionary security evaluation of agents,

  12. [12]

    Qin, Y ., Liang, S., Ye, Y ., Zhu, K., Yan, L., Lu, Y ., Lin, Y ., Cong, X., Tang, X., Qian, B., et al

    URL https://arxiv.org/abs/2602.07391. Qin, Y ., Liang, S., Ye, Y ., Zhu, K., Yan, L., Lu, Y ., Lin, Y ., Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitat- ing large language models to master 16000+ real-world apis. InThe twelfth international conference on learning representations,

  13. [13]

    Shi, G., Du, H., Wang, Z., Liang, X., Liu, W., Bian, S., and Guan, Z

    URL https: //fordelstudios.com/research/mcp- production-engineering-guide. Shi, G., Du, H., Wang, Z., Liang, X., Liu, W., Bian, S., and Guan, Z. Sok: Trust-authorization mismatch in llm agent interactions.arXiv preprint arXiv:2512.06914,

  14. [14]

    The illusion of role separation: Hidden shortcuts in llm role learning (and how to fix them).arXiv preprint arXiv:2505.00626,

    Wang, Z., Jiang, Y ., Yu, J., and Huang, H. The illusion of role separation: Hidden shortcuts in llm role learning (and how to fix them).arXiv preprint arXiv:2505.00626,

  15. [15]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

  16. [16]

    Injeca- gent: Benchmarking indirect prompt injections in tool- integrated large language model agents

    Zhan, Q., Liang, Z., Ying, Z., and Kang, D. Injeca- gent: Benchmarking indirect prompt injections in tool- integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 10471–10506,

  17. [17]

    S., and Kang, D

    Zhan, Q., Fang, R., Panchal, H. S., and Kang, D. Adaptive attacks break defenses against indirect prompt injection attacks on llm agents. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 7101– 7117,

  18. [18]

    status":

    6 V ATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation A. Mutation Examples We show the seed error payload and representative mutations across each dimension. Each example shows the exact JSON returned to the agent as a tool error response. A.1. Seed Payload The starting point for all mutations. Imperative framing, no autho...