Hallucination as Exploit: Evidence-Carrying Multimodal Agents
Pith reviewed 2026-05-22 08:53 UTC · model grok-4.3
The pith
Multimodal agents eliminate unsafe tool calls by authorizing actions only with certificates from content verifiers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that hallucination-to-action conversion can be prevented by treating model language as inadmissible for authorization and instead requiring deterministic certificates from hardened verifiers for every action-critical predicate. With content-derived certificates the system records zero unsafe executions on 200 end-to-end tasks and 120 browser tasks, while the same unsupported claims trigger unsafe behavior in every naive agent and in nearly half of prompt-only cases. Verifier red-teaming across 17 attack categories confirms that four hardening steps are each necessary to reach a bypass rate of zero in 1,700 trials.
What carries the argument
Evidence-carrying multimodal agents (ECA), which decompose each tool call into action-critical predicates, obtain typed certificates from constrained verifiers, and apply a deterministic gate that authorizes privileges only when those certificates are present.
If this is right
- Naive agents reach 100 percent unsafe execution when given unsupported action-critical claims.
- Prompt-only defenses still allow unsafe execution in 49.6 percent of cases under the same claims.
- ECA records zero unsafe executions on 200 end-to-end tasks and 120 browser tasks.
- After four hardening steps, canonical gate bypass drops to zero across 1,700 verifier red-team trials.
Where Pith is reading between the lines
- The same certificate requirement could be applied to other perception sources such as video frames or API responses in non-browser agents.
- Agent platforms may need to expose standardized verifiable interfaces for every tool to support this authorization model.
- The separation of proposal by the model from authorization by certificates reduces dependence on model alignment alone for safety.
Load-bearing premise
The hardened DOM, OCR, and AX verifiers always produce accurate certificates that reflect actual content and cannot be bypassed, while the chosen predicates cover every safety-relevant precondition.
What would settle it
Discovery of even one unsafe tool execution by an ECA agent on a task containing a hallucinated action-critical claim, or a successful bypass of the hardened verifiers by any of the 17 attack categories.
Figures
read the original abstract
Multimodal agents increasingly choose tool calls from screenshots, documents, and webpages, where a false perceptual claim can turn hallucination from an answer-quality error into an authorization failure. We formalize this failure mode as hallucination-to-action conversion: an unsupported claim supplies the precondition for a privileged action. We propose evidence-carrying multimodal agents (ECA), which treat free-form model text as inadmissible evidence, decompose each tool call into action-critical predicates, obtain typed certificates from constrained DOM/OCR/AX verifiers, and use a deterministic gate to authorize only the privileges those certificates support. Rather than hiding perception error, ECA converts opaque model belief into auditable residuals at the verifier, schema, and implementation levels. Verifier red-teaming across 17 canonical attack categories shows that four targeted hardening steps are each necessary; after hardening, canonical gate bypass is 0/1,700 (Wilson 95% upper bound 0.22%). With content-derived certificates, ECA observes zero unsafe executions on 200 end-to-end tasks (Wilson 95% upper bound 2.67%) and 120 browser tasks (upper bound 4.3%). A HACR audit on 500 stratified task keys shows that unsupported action-critical claims reach unsafe execution for naive agents (100.0%) and prompt-only defenses (49.6%), but not for ECA. Oracle-certificate replay over 7,488 GPT-5.4 traces isolates gate correctness, while neural judge baselines still admit most unsafe actions under the same threat model. The resulting principle is simple: model language may propose tool use, but certified predicates must authorize it.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Evidence-Carrying Multimodal Agents (ECA) to mitigate hallucination-to-action conversion, where unsupported perceptual claims enable unsafe tool calls. ECA decomposes each tool call into action-critical predicates, obtains typed certificates from hardened constrained DOM/OCR/AX verifiers, and authorizes actions only via a deterministic gate based on those certificates. It reports zero unsafe executions across 200 end-to-end tasks (Wilson 95% upper bound 2.67%) and 120 browser tasks (upper bound 4.3%), a 0/1,700 bypass rate for 17 canonical attack categories after four hardening steps, and contrasts this with 100% and 49.6% unsafe rates for naive agents and prompt-only defenses in a HACR audit on 500 tasks. Oracle-certificate replay over 7,488 traces isolates gate correctness from model outputs.
Significance. If the results hold, this provides a concrete mechanism for grounding multimodal agent actions in verifiable certificates rather than model beliefs, with clear empirical separation from baselines via statistical bounds and oracle isolation. The approach converts perception errors into auditable residuals and demonstrates practical safety gains on end-to-end tasks. Strengths include the use of Wilson bounds for upper limits on failure rates and the replay experiment for isolating the gate component.
major comments (2)
- [Verifier red-teaming section] Verifier red-teaming across 17 canonical attack categories: the 0/1,700 bypass rate and Wilson upper bound of 0.22% after the four hardening steps only covers the tested categories. The manuscript provides no analysis or experiments addressing novel attacks, adaptive adversaries, dynamic content, or other bypass vectors that could produce certificates allowing unsafe actions to pass the deterministic gate. This assumption is load-bearing for the zero-unsafe-execution claims on the 200 end-to-end and 120 browser tasks.
- [Action-critical predicates description] Action-critical predicate decomposition: the paper treats the decomposition as both necessary and sufficient for capturing all safety-relevant preconditions, yet offers no formal coverage argument, completeness proof, or exhaustive enumeration showing that every possible unsafe precondition is represented by these predicates. If any precondition is omitted, accurate certificates could still authorize unsafe actions.
minor comments (2)
- [Abstract] The abstract references 'GPT-5.4 traces'; clarify whether this denotes a specific model release or is a descriptive placeholder for the replay experiment.
- [Results figures] Figure captions for the HACR audit and oracle replay results could more explicitly state the threat model and isolation guarantees to aid reader interpretation.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation for major revision. The comments highlight important considerations regarding the scope of our security evaluation and the coverage of the predicate decomposition. We respond to each point below with clarifications and proposed revisions that strengthen the presentation of our results without overstating the claims.
read point-by-point responses
-
Referee: [Verifier red-teaming section] Verifier red-teaming across 17 canonical attack categories: the 0/1,700 bypass rate and Wilson upper bound of 0.22% after the four hardening steps only covers the tested categories. The manuscript provides no analysis or experiments addressing novel attacks, adaptive adversaries, dynamic content, or other bypass vectors that could produce certificates allowing unsafe actions to pass the deterministic gate. This assumption is load-bearing for the zero-unsafe-execution claims on the 200 end-to-end and 120 browser tasks.
Authors: We agree that the red-teaming results apply specifically to the 17 canonical attack categories we defined and tested. The 0/1,700 bypass rate and associated Wilson bound demonstrate effective mitigation for those vectors following the four hardening steps. We do not claim or imply resistance to arbitrary novel attacks, adaptive adversaries, or untested dynamic content, as such guarantees would require a different formal security model. The end-to-end zero-unsafe-execution results remain empirical observations under the evaluated threat model and task distribution, with the reported statistical bounds. We will add a dedicated limitations subsection that explicitly discusses these boundaries, notes the potential for adaptive bypasses, and suggests avenues for ongoing red-teaming. This is a partial revision. revision: partial
-
Referee: [Action-critical predicates description] Action-critical predicate decomposition: the paper treats the decomposition as both necessary and sufficient for capturing all safety-relevant preconditions, yet offers no formal coverage argument, completeness proof, or exhaustive enumeration showing that every possible unsafe precondition is represented by these predicates. If any precondition is omitted, accurate certificates could still authorize unsafe actions.
Authors: The predicates are obtained by inspecting the safety-relevant preconditions of each tool API within the concrete task environments we study. We do not assert a formal completeness proof or exhaustive enumeration, as the space of possible unsafe preconditions is open-ended in general multimodal settings. Instead, the approach relies on making any coverage gaps visible through verifier residuals and schema constraints, which are then audited in the HACR evaluation. The empirical results show that the selected predicates block the unsafe executions that occur under naive and prompt-only baselines. We will revise the manuscript to state clearly that the predicates are task-derived rather than universally complete and to include a concrete example of predicate construction for one representative task. This is a full revision to the relevant description and discussion sections. revision: yes
Circularity Check
No significant circularity in the ECA derivation chain
full rationale
The paper grounds its zero-unsafe-execution claims in two independent empirical mechanisms: red-teaming of the hardened DOM/OCR/AX verifiers against 17 canonical attack categories (yielding 0/1,700 bypasses) and oracle-certificate replay over 7,488 traces that isolates deterministic gate correctness from model outputs. These checks are external to the agent's perceptual claims and do not reduce the reported safety bounds to quantities defined by the authors' own fitted parameters or self-referential predicates. The action-critical predicate decomposition is presented as an explicit design choice whose coverage is validated by the HACR audit rather than assumed by construction. No load-bearing step equates a derived result to its input by definition, self-citation, or renaming.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Constrained DOM/OCR/AX verifiers can be hardened to resist all 17 canonical attack categories while still producing accurate certificates for legitimate content.
invented entities (1)
-
Evidence-Carrying Multimodal Agent (ECA) with deterministic gate
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , year =
Object Hallucination in Image Captioning , author =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , year =
work page 2018
-
[2]
Evaluating Object Hallucination in Large Vision-Language Models
Evaluating Object Hallucination in Large Vision-Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , year =. 2305.10355 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Hallucination of Multimodal Large Language Models: A Survey , author =. 2024 , eprint =
work page 2024
-
[4]
Advances in Neural Information Processing Systems , year =
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models , author =. Advances in Neural Information Processing Systems , year =. 2505.21523 , archivePrefix =
-
[5]
Bang, Yejin and Ji, Ziwei and Schelten, Alan and Hartshorn, Anthony and Fowler, Tara and Zhang, Cheng and Cancedda, Nicola and Fung, Pascale , booktitle =
-
[6]
Ignore Previous Prompt: Attack Techniques For Language Models , author =. 2022 , eprint =
work page 2022
-
[7]
Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , booktitle =. Not What You've Signed Up For: Compromising Real-World. 2023 , doi =. 2302.12173 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
(ab) using images and sounds for indirect instruction injection in multi-modal llms,
Bagdasaryan, Eugene and Hsieh, Tsung-Yin and Nassi, Ben and Shmatikov, Vitaly , year =. Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal. 2307.10490 , archivePrefix =
-
[9]
Advances in Neural Information Processing Systems , volume =
Debenedetti, Edoardo and Zhang, Jie and Balunovi. Advances in Neural Information Processing Systems , volume =
-
[10]
Xia, Hongfei and Wang, Hongru and Liu, Zeming and Yu, Qian and Guo, Yuhang and Wang, Haifeng , booktitle =. 2025 , publisher =
work page 2025
-
[11]
AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?
Li, Hao and Wen, Ruoyao and Shi, Shanghao and Zhang, Ning and Vorobeychik, Yevgeniy and Xiao, Chaowei , year =. 2602.03117 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Cao, Tri and Lim, Bennett and Liu, Yue and Sui, Yuan and Li, Yuexin and Deng, Shumin and Lu, Lin and Oo, Nay and Yan, Shuicheng and Hooi, Bryan , booktitle =. 2026 , eprint =
work page 2026
-
[13]
Agentvigil: Generic black-box red- teaming for indirect prompt injection against llm agents
Wang, Zhun and Siu, Vincent and Ye, Zhe and Shi, Tianneng and Nie, Yuzhou and Zhao, Xuandong and Wang, Chenguang and Guo, Wenbo and Song, Dawn , booktitle =. 2025 , url =. 2505.05849 , archivePrefix =
-
[14]
Benchmarking and defending against indirect prompt injection attacks on large language models,
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models , author =. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining , year =. 2312.14197 , archivePrefix =
-
[15]
Abdelnabi, Sahar and Fay, Aideen and Salem, Ahmed and Zverev, Egor and Liao, Kai-Chieh and Liu, Chi-Huang and Kuo, Chun-Chih and Weigend, Jannis and Manlangit, Danyael and Apostolov, Alex and others , year =. 2506.09956 , archivePrefix =
-
[16]
Beurer-Kellner, Luca and Buesser, Beat and Cre. Design Patterns for Securing. 2025 , eprint =
work page 2025
- [17]
- [18]
-
[19]
Progent: Securing AI Agents with Privilege Control
Shi, Tianneng and He, Jingxuan and Wang, Zhun and Li, Hongwei and Wu, Linyu and Guo, Wenbo and Song, Dawn , year =. 2504.11703 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Zhu, Kaijie and Yang, Xianjun and Wang, Jindong and Guo, Wenbo and Wang, William Yang , booktitle =. 2025 , eprint =
work page 2025
-
[21]
The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in
Jia, Feiran and Wu, Tong and Qin, Xin and Squicciarini, Anna , booktitle =. The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in. 2025 , eprint =
work page 2025
-
[22]
System-Level Defense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective , author =. 2024 , eprint =
work page 2024
-
[23]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =
Unified Hallucination Detection for Multimodal Large Language Models , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =
-
[24]
Koh, Jing Yu and Lo, Robert and Jang, Lawrence and Duvvur, Vikram and Lim, Ming Chong and Huang, Po-Yu and Neubig, Graham and Zhou, Shuyan and Salakhutdinov, Ruslan and Fried, Daniel , booktitle =. 2024 , eprint =
work page 2024
-
[25]
Mathew, Minesh and Karatzas, Dimosthenis and Jawahar, C. V. , booktitle =
-
[26]
Image-based Prompt Injection: Hijacking Multimodal
Nagaraja, Neha and Zhang, Lan and Wang, Zhilong and Zhang, Bo and Patil, Pawan , booktitle =. Image-based Prompt Injection: Hijacking Multimodal
-
[27]
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms , author =. 2026 , eprint =
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.