pith. sign in

arxiv: 2605.19192 · v2 · pith:XAYZXLR6new · submitted 2026-05-18 · 💻 cs.AI · cs.CR

Hallucination as Exploit: Evidence-Carrying Multimodal Agents

Pith reviewed 2026-05-22 08:53 UTC · model grok-4.3

classification 💻 cs.AI cs.CR
keywords multimodal agentshallucinationtool use safetyevidence certificatesverifiersaction authorizationagent securityDOM verification
0
0 comments X

The pith

Multimodal agents eliminate unsafe tool calls by authorizing actions only with certificates from content verifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that hallucinations become safety failures when an agent's model text supplies a false precondition for a privileged tool call. It introduces evidence-carrying agents that reject free-form model output as evidence, break each proposed action into critical predicates, and require typed certificates from constrained DOM, OCR, and AX verifiers before any authorization occurs. A deterministic gate then permits only the privileges the certificates explicitly support. Tests across hundreds of tasks show this design produces no unsafe executions, whereas naive agents and prompt-only approaches allow unsafe actions at rates of 100 percent and 49.6 percent respectively when unsupported claims appear. A reader would care because the method converts an opaque model belief into an auditable residual that can be checked independently of the model's correctness.

Core claim

The central claim is that hallucination-to-action conversion can be prevented by treating model language as inadmissible for authorization and instead requiring deterministic certificates from hardened verifiers for every action-critical predicate. With content-derived certificates the system records zero unsafe executions on 200 end-to-end tasks and 120 browser tasks, while the same unsupported claims trigger unsafe behavior in every naive agent and in nearly half of prompt-only cases. Verifier red-teaming across 17 attack categories confirms that four hardening steps are each necessary to reach a bypass rate of zero in 1,700 trials.

What carries the argument

Evidence-carrying multimodal agents (ECA), which decompose each tool call into action-critical predicates, obtain typed certificates from constrained verifiers, and apply a deterministic gate that authorizes privileges only when those certificates are present.

If this is right

  • Naive agents reach 100 percent unsafe execution when given unsupported action-critical claims.
  • Prompt-only defenses still allow unsafe execution in 49.6 percent of cases under the same claims.
  • ECA records zero unsafe executions on 200 end-to-end tasks and 120 browser tasks.
  • After four hardening steps, canonical gate bypass drops to zero across 1,700 verifier red-team trials.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same certificate requirement could be applied to other perception sources such as video frames or API responses in non-browser agents.
  • Agent platforms may need to expose standardized verifiable interfaces for every tool to support this authorization model.
  • The separation of proposal by the model from authorization by certificates reduces dependence on model alignment alone for safety.

Load-bearing premise

The hardened DOM, OCR, and AX verifiers always produce accurate certificates that reflect actual content and cannot be bypassed, while the chosen predicates cover every safety-relevant precondition.

What would settle it

Discovery of even one unsafe tool execution by an ECA agent on a task containing a hallucinated action-critical claim, or a successful bypass of the hardened verifiers by any of the 17 attack categories.

Figures

Figures reproduced from arXiv: 2605.19192 by Guijia Zhang, Hao Zheng, Harry Yang.

Figure 1
Figure 1. Figure 1: Evidence-carrying multimodal agents (ECA). A single observation o∈O feeds two strictly parallel, symmetric lanes. Top (untrusted): the MLLM proposes an action; an action schema Ga declares the predicates that must be certified. Bottom (trusted): constrained verifiers consume the raw observation and emit typed certificates e= (τ, v, r, s, ν, κ, t, λ). The deterministic gate Π(a, E) authorises execution iff … view at source ↗
Figure 2
Figure 2. Figure 2: Authorization performance and risk convergence. (a) Safety-utility frontier across clean evaluation traces; the gray cross groups five collapsed non-prompt baselines and ablations. (b) Prompt-only ASR across 12 adversarial subgroups collapses to 0% under ECA. Colors and markers denote AgentDojo, AgentDyn, and VPI-Bench families. SafeToolBench is excluded because its prospective-risk ASR is 0% by definition… view at source ↗
Figure 3
Figure 3. Figure 3: Per-benchmark metric heatmap. (A) ASR on unsafe tasks. (B) UAR on unsafe tasks. (C) Benign success on safe tasks. bypass. Metadata-structural attacks (field re￾moval/renaming): 428/600 bypass. In this stress test, the parser resists content-level evasion but is trivially bypassed when the attacker controls meta￾data field structure, a threat model presupposing compromised infrastructure. A.9 Overhead analy… view at source ↗
Figure 4
Figure 4. Figure 4: Per-benchmark detailed breakdown. (A) ASR by attack benchmark with Wilson 95% CI. (B) UAR by attack benchmark. (C) Benign success by utility benchmark. (D) Gate decision distribution. 100.0% 91.4%98.8% 100.0% 100.0% 0% 25% 50% 75% 100% AgentDojo AgentDyn SafeToolBench VPI-Bench Agreement (%) Trust label Gate decision Oracle vs. parser agreement 80.1% 80.2% 0.0% 73.9% 0% 20% 40% 60% 80% AgentDojo AgentDyn S… view at source ↗
Figure 5
Figure 5. Figure 5: Oracle vs. content-parser certificate comparison. (A) Agreement rates. (B) ASR under oracle and parser certificates. (C) Benign success. (D) Decision flips: all 117 flips are conservative (allow→block). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative case studies. Red panels (A–B): unsafe tasks blocked by ECA. Blue panels (C–D): benign tasks correctly allowed. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative case studies. Red panels (A–B): unsafe tasks blocked by ECA. Blue panels (C–D): benign tasks correctly allowed. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

Multimodal agents increasingly choose tool calls from screenshots, documents, and webpages, where a false perceptual claim can turn hallucination from an answer-quality error into an authorization failure. We formalize this failure mode as hallucination-to-action conversion: an unsupported claim supplies the precondition for a privileged action. We propose evidence-carrying multimodal agents (ECA), which treat free-form model text as inadmissible evidence, decompose each tool call into action-critical predicates, obtain typed certificates from constrained DOM/OCR/AX verifiers, and use a deterministic gate to authorize only the privileges those certificates support. Rather than hiding perception error, ECA converts opaque model belief into auditable residuals at the verifier, schema, and implementation levels. Verifier red-teaming across 17 canonical attack categories shows that four targeted hardening steps are each necessary; after hardening, canonical gate bypass is 0/1,700 (Wilson 95% upper bound 0.22%). With content-derived certificates, ECA observes zero unsafe executions on 200 end-to-end tasks (Wilson 95% upper bound 2.67%) and 120 browser tasks (upper bound 4.3%). A HACR audit on 500 stratified task keys shows that unsupported action-critical claims reach unsafe execution for naive agents (100.0%) and prompt-only defenses (49.6%), but not for ECA. Oracle-certificate replay over 7,488 GPT-5.4 traces isolates gate correctness, while neural judge baselines still admit most unsafe actions under the same threat model. The resulting principle is simple: model language may propose tool use, but certified predicates must authorize it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Evidence-Carrying Multimodal Agents (ECA) to mitigate hallucination-to-action conversion, where unsupported perceptual claims enable unsafe tool calls. ECA decomposes each tool call into action-critical predicates, obtains typed certificates from hardened constrained DOM/OCR/AX verifiers, and authorizes actions only via a deterministic gate based on those certificates. It reports zero unsafe executions across 200 end-to-end tasks (Wilson 95% upper bound 2.67%) and 120 browser tasks (upper bound 4.3%), a 0/1,700 bypass rate for 17 canonical attack categories after four hardening steps, and contrasts this with 100% and 49.6% unsafe rates for naive agents and prompt-only defenses in a HACR audit on 500 tasks. Oracle-certificate replay over 7,488 traces isolates gate correctness from model outputs.

Significance. If the results hold, this provides a concrete mechanism for grounding multimodal agent actions in verifiable certificates rather than model beliefs, with clear empirical separation from baselines via statistical bounds and oracle isolation. The approach converts perception errors into auditable residuals and demonstrates practical safety gains on end-to-end tasks. Strengths include the use of Wilson bounds for upper limits on failure rates and the replay experiment for isolating the gate component.

major comments (2)
  1. [Verifier red-teaming section] Verifier red-teaming across 17 canonical attack categories: the 0/1,700 bypass rate and Wilson upper bound of 0.22% after the four hardening steps only covers the tested categories. The manuscript provides no analysis or experiments addressing novel attacks, adaptive adversaries, dynamic content, or other bypass vectors that could produce certificates allowing unsafe actions to pass the deterministic gate. This assumption is load-bearing for the zero-unsafe-execution claims on the 200 end-to-end and 120 browser tasks.
  2. [Action-critical predicates description] Action-critical predicate decomposition: the paper treats the decomposition as both necessary and sufficient for capturing all safety-relevant preconditions, yet offers no formal coverage argument, completeness proof, or exhaustive enumeration showing that every possible unsafe precondition is represented by these predicates. If any precondition is omitted, accurate certificates could still authorize unsafe actions.
minor comments (2)
  1. [Abstract] The abstract references 'GPT-5.4 traces'; clarify whether this denotes a specific model release or is a descriptive placeholder for the replay experiment.
  2. [Results figures] Figure captions for the HACR audit and oracle replay results could more explicitly state the threat model and isolation guarantees to aid reader interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. The comments highlight important considerations regarding the scope of our security evaluation and the coverage of the predicate decomposition. We respond to each point below with clarifications and proposed revisions that strengthen the presentation of our results without overstating the claims.

read point-by-point responses
  1. Referee: [Verifier red-teaming section] Verifier red-teaming across 17 canonical attack categories: the 0/1,700 bypass rate and Wilson upper bound of 0.22% after the four hardening steps only covers the tested categories. The manuscript provides no analysis or experiments addressing novel attacks, adaptive adversaries, dynamic content, or other bypass vectors that could produce certificates allowing unsafe actions to pass the deterministic gate. This assumption is load-bearing for the zero-unsafe-execution claims on the 200 end-to-end and 120 browser tasks.

    Authors: We agree that the red-teaming results apply specifically to the 17 canonical attack categories we defined and tested. The 0/1,700 bypass rate and associated Wilson bound demonstrate effective mitigation for those vectors following the four hardening steps. We do not claim or imply resistance to arbitrary novel attacks, adaptive adversaries, or untested dynamic content, as such guarantees would require a different formal security model. The end-to-end zero-unsafe-execution results remain empirical observations under the evaluated threat model and task distribution, with the reported statistical bounds. We will add a dedicated limitations subsection that explicitly discusses these boundaries, notes the potential for adaptive bypasses, and suggests avenues for ongoing red-teaming. This is a partial revision. revision: partial

  2. Referee: [Action-critical predicates description] Action-critical predicate decomposition: the paper treats the decomposition as both necessary and sufficient for capturing all safety-relevant preconditions, yet offers no formal coverage argument, completeness proof, or exhaustive enumeration showing that every possible unsafe precondition is represented by these predicates. If any precondition is omitted, accurate certificates could still authorize unsafe actions.

    Authors: The predicates are obtained by inspecting the safety-relevant preconditions of each tool API within the concrete task environments we study. We do not assert a formal completeness proof or exhaustive enumeration, as the space of possible unsafe preconditions is open-ended in general multimodal settings. Instead, the approach relies on making any coverage gaps visible through verifier residuals and schema constraints, which are then audited in the HACR evaluation. The empirical results show that the selected predicates block the unsafe executions that occur under naive and prompt-only baselines. We will revise the manuscript to state clearly that the predicates are task-derived rather than universally complete and to include a concrete example of predicate construction for one representative task. This is a full revision to the relevant description and discussion sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the ECA derivation chain

full rationale

The paper grounds its zero-unsafe-execution claims in two independent empirical mechanisms: red-teaming of the hardened DOM/OCR/AX verifiers against 17 canonical attack categories (yielding 0/1,700 bypasses) and oracle-certificate replay over 7,488 traces that isolates deterministic gate correctness from model outputs. These checks are external to the agent's perceptual claims and do not reduce the reported safety bounds to quantities defined by the authors' own fitted parameters or self-referential predicates. The action-critical predicate decomposition is presented as an explicit design choice whose coverage is validated by the HACR audit rather than assumed by construction. No load-bearing step equates a derived result to its input by definition, self-citation, or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The security claims rest on the unproven reliability of the hardened verifiers and the completeness of the predicate decomposition; these are domain assumptions rather than quantities derived from external benchmarks or shipped artifacts.

axioms (1)
  • domain assumption Constrained DOM/OCR/AX verifiers can be hardened to resist all 17 canonical attack categories while still producing accurate certificates for legitimate content.
    This premise is required for the zero-bypass and zero-unsafe-execution claims to hold.
invented entities (1)
  • Evidence-Carrying Multimodal Agent (ECA) with deterministic gate no independent evidence
    purpose: To enforce that only certified predicates authorize privileged tool calls instead of model-generated text.
    Newly introduced system component whose correctness is not independently verified outside the paper's experiments.

pith-pipeline@v0.9.0 · 5827 in / 1543 out tokens · 48610 ms · 2026-05-22T08:53:22.462684+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 4 internal anchors

  1. [1]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , year =

    Object Hallucination in Image Captioning , author =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , year =

  2. [2]

    Evaluating Object Hallucination in Large Vision-Language Models

    Evaluating Object Hallucination in Large Vision-Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , year =. 2305.10355 , archivePrefix =

  3. [3]

    2024 , eprint =

    Hallucination of Multimodal Large Language Models: A Survey , author =. 2024 , eprint =

  4. [4]

    Advances in Neural Information Processing Systems , year =

    More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models , author =. Advances in Neural Information Processing Systems , year =. 2505.21523 , archivePrefix =

  5. [5]

    Bang, Yejin and Ji, Ziwei and Schelten, Alan and Hartshorn, Anthony and Fowler, Tara and Zhang, Cheng and Cancedda, Nicola and Fung, Pascale , booktitle =

  6. [6]

    2022 , eprint =

    Ignore Previous Prompt: Attack Techniques For Language Models , author =. 2022 , eprint =

  7. [7]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , booktitle =. Not What You've Signed Up For: Compromising Real-World. 2023 , doi =. 2302.12173 , archivePrefix =

  8. [8]

    (ab) using images and sounds for indirect instruction injection in multi-modal llms,

    Bagdasaryan, Eugene and Hsieh, Tsung-Yin and Nassi, Ben and Shmatikov, Vitaly , year =. Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal. 2307.10490 , archivePrefix =

  9. [9]

    Advances in Neural Information Processing Systems , volume =

    Debenedetti, Edoardo and Zhang, Jie and Balunovi. Advances in Neural Information Processing Systems , volume =

  10. [10]

    2025 , publisher =

    Xia, Hongfei and Wang, Hongru and Liu, Zeming and Yu, Qian and Guo, Yuhang and Wang, Haifeng , booktitle =. 2025 , publisher =

  11. [11]

    AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?

    Li, Hao and Wen, Ruoyao and Shi, Shanghao and Zhang, Ning and Vorobeychik, Yevgeniy and Xiao, Chaowei , year =. 2602.03117 , archivePrefix =

  12. [12]

    2026 , eprint =

    Cao, Tri and Lim, Bennett and Liu, Yue and Sui, Yuan and Li, Yuexin and Deng, Shumin and Lu, Lin and Oo, Nay and Yan, Shuicheng and Hooi, Bryan , booktitle =. 2026 , eprint =

  13. [13]

    Agentvigil: Generic black-box red- teaming for indirect prompt injection against llm agents

    Wang, Zhun and Siu, Vincent and Ye, Zhe and Shi, Tianneng and Nie, Yuzhou and Zhao, Xuandong and Wang, Chenguang and Guo, Wenbo and Song, Dawn , booktitle =. 2025 , url =. 2505.05849 , archivePrefix =

  14. [14]

    Benchmarking and defending against indirect prompt injection attacks on large language models,

    Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models , author =. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining , year =. 2312.14197 , archivePrefix =

  15. [15]

    2506.09956 , archivePrefix =

    Abdelnabi, Sahar and Fay, Aideen and Salem, Ahmed and Zverev, Egor and Liao, Kai-Chieh and Liu, Chi-Huang and Kuo, Chun-Chih and Weigend, Jannis and Manlangit, Danyael and Apostolov, Alex and others , year =. 2506.09956 , archivePrefix =

  16. [16]

    Design Patterns for Securing

    Beurer-Kellner, Luca and Buesser, Beat and Cre. Design Patterns for Securing. 2025 , eprint =

  17. [17]

    Securing

    Costa, Manuel and K. Securing. 2025 , eprint =

  18. [18]

    2025 , eprint =

    Defeating Prompt Injections by Design , author =. 2025 , eprint =

  19. [19]

    Progent: Securing AI Agents with Privilege Control

    Shi, Tianneng and He, Jingxuan and Wang, Zhun and Li, Hongwei and Wu, Linyu and Guo, Wenbo and Song, Dawn , year =. 2504.11703 , archivePrefix =

  20. [20]

    2025 , eprint =

    Zhu, Kaijie and Yang, Xianjun and Wang, Jindong and Guo, Wenbo and Wang, William Yang , booktitle =. 2025 , eprint =

  21. [21]

    The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in

    Jia, Feiran and Wu, Tong and Qin, Xin and Squicciarini, Anna , booktitle =. The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in. 2025 , eprint =

  22. [22]

    2024 , eprint =

    System-Level Defense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective , author =. 2024 , eprint =

  23. [23]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

    Unified Hallucination Detection for Multimodal Large Language Models , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

  24. [24]

    2024 , eprint =

    Koh, Jing Yu and Lo, Robert and Jang, Lawrence and Duvvur, Vikram and Lim, Ming Chong and Huang, Po-Yu and Neubig, Graham and Zhou, Shuyan and Salakhutdinov, Ruslan and Fried, Daniel , booktitle =. 2024 , eprint =

  25. [25]

    Mathew, Minesh and Karatzas, Dimosthenis and Jawahar, C. V. , booktitle =

  26. [26]

    Image-based Prompt Injection: Hijacking Multimodal

    Nagaraja, Neha and Zhang, Lan and Wang, Zhilong and Zhang, Bo and Patil, Pawan , booktitle =. Image-based Prompt Injection: Hijacking Multimodal

  27. [27]

    2026 , eprint =

    Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms , author =. 2026 , eprint =