pith. sign in

arxiv: 2606.20717 · v1 · pith:IQKXYTBFnew · submitted 2026-06-16 · 💻 cs.CV · cs.AI· cs.CR

MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents

Pith reviewed 2026-06-27 00:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CR
keywords visual prompt injectionMLLM web agentsadversarial imagesdiffusion modelsnext-action hijackingstealthy attacksconstrained threat modelweb automation vulnerabilities
0
0 comments X

The pith

Diffusion models can generate ad-slot images that hijack MLLM web agents into attacker-chosen actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MIRAGE, a framework for visual indirect prompt injection that generates adversarial images strictly inside attacker-controlled but legitimate page regions such as ad slots. These images are produced by diffusion models and optimized to remain visually normal to humans while directing multimodal web agents toward specific next actions chosen by the attacker. The work operates under a constrained threat model where the attacker has no privileged access and must stay within semantically appropriate boundaries. Evaluations on SeeAct and OpenClaw demonstrate that the resulting attacks succeed at scale while preserving stealth. If correct, the claim shows that current MLLM web agents remain open to targeted hijacking even when the attacker controls only a small, permitted portion of the interface.

Core claim

MIRAGE leverages diffusion models to generate perceptually benign adversarial images confined to attacker-controlled boundaries such as ad slots, using curvature-aware adversarial diffusion guidance combined with sparse, dark-pixel residual perturbations to achieve targeted next-action hijacking in MLLM web agents.

What carries the argument

MIRAGE, a visual indirect prompt injection framework that generates adversarial images via diffusion models optimized for both human imperceptibility and action hijacking within spatially constrained legitimate regions.

If this is right

  • Attacks succeed even when the adversary controls only a small legitimate region of the page.
  • The generated images evade human detection while still changing agent behavior on SeeAct and OpenClaw.
  • Curvature-aware guidance plus sparse dark-pixel perturbations suffice to maximize efficacy inside the restricted setting.
  • The same constrained threat model applies to any MLLM web agent that processes page screenshots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Web platforms may need additional image sanitization or region-specific verification steps for third-party content.
  • Agents could incorporate explicit checks that discount or cross-verify visual content originating from known ad-like areas.
  • The same diffusion-based injection approach could be tested on other multimodal interfaces that accept external imagery under similar spatial constraints.

Load-bearing premise

Diffusion models can reliably produce images that stay both perceptually benign to humans and effective for next-action hijacking when strictly confined to semantically legitimate, spatially constrained regions such as ad slots.

What would settle it

An experiment in which no diffusion-generated image placed inside an ad slot on a live page alters the agent's next action without also becoming visibly altered to human viewers.

Figures

Figures reproduced from arXiv: 2606.20717 by Biwei Yan, Boyang Ma, Jianyu Ma, Xuelong Dai, Yijun Yang, Yue Zhang.

Figure 1
Figure 1. Figure 1: Illustration of MIRAGE. A trusted webpage contains a bounded region controlled by a third-party merchant or advertiser. The attacker replaces only this region with optimized visual content. When the web agent observes the rendered screenshot, the injected region can redirect the agent’s visual reasoning and next-action prediction, even though the site-owned interface, browser execution, and user instructio… view at source ↗
Figure 2
Figure 2. Figure 2: Ablation studies on the parameter settings for adversarial diffusion sampling. The experiments are evaluated on SeeAct-LLaVA. action grounding stage. Ablation studies on adversarial diffusion sam￾pling. Incorporating diffusion models yields a sub￾stantial gain in attack performance, as shown in Ta￾ble 2. Increasing the hyperparameters T and K im￾proves the attack success rate, though with a notice￾able tra… view at source ↗
Figure 3
Figure 3. Figure 3: The selection of semantic similarity thresh￾old. E Cross-Attention Heatmap with MIRAGE To illustrate the underlying mechanism, [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The cross-attention heatmaps from the clean screenshot and MIRAGE screenshot. The red box represents the injected patch. jacking and defense evaluation in tool-using en￾vironments (Xu et al., 2025; Zhang et al., 2025a; Debenedetti et al., 2024). These attacks reveal that web agents are vulnerable to external context, but most of them rely on explicit text, pop-ups, webpage-environment manipulation, or prom… view at source ↗
Figure 5
Figure 5. Figure 5: An attack screenshot with Naive Attack. ability of adversarial examples. Diff-PGD uses dif￾fusion guidance to keep adversarial samples close to the natural image distribution while preserv￾ing attack effectiveness (Xue et al., 2023). Dif￾fAttack and AdvDiff generate more imperceptible or unrestricted adversarial examples by manipulat￾ing diffusion latent spaces or sampling trajectories (Chen et al., 2023a;… view at source ↗
Figure 6
Figure 6. Figure 6: An attack screenshot with WebInject [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An attack screenshot with EIA [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: An attack screenshot with Popup Attack [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An attack screenshot with MIRAGE [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An attack screenshot with MIRAGE. The screenshot is visually aggressive with a light-colored, simple background [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Multimodal Large Language Model (MLLM)-based web agents provide practical, high-precision solutions for visual browser automation; however, they inherently expand the attack surface, introducing novel vision-based vulnerabilities. Existing adversarial evaluations targeting these agents frequently rely on permissive threat models and visually conspicuous artifacts. In this paper, we investigate a constrained vulnerability detection setting: a trusted web platform where the evaluator acts solely as an unprivileged third party, such as a merchant or advertiser, controlling only a semantically legitimate, spatially constrained region, such as an ad slot, a sponsored card, or a localized widget. Operating under these realistic constraints, we propose MIRAGE, a novel visual indirect prompt injection framework for targeted next-action hijacking. Our approach leverages diffusion models to generate perceptually benign adversarial images strictly confined to the attacker-controlled boundaries permitted by the trusted service provider. To maximize attack efficacy within such a restrictive setting, we introduce a robust optimization technique combining curvature-aware adversarial diffusion guidance with sparse, dark-pixel residual perturbations. Comprehensive evaluations against prominent MLLM web agent frameworks, specifically SeeAct and OpenClaw, empirically demonstrate the potency, realism, and stealth of our proposed MIRAGE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes MIRAGE, a visual indirect prompt injection framework for MLLM-based web agents. Under a constrained threat model where the attacker controls only a semantically legitimate region (e.g., ad slot), it uses diffusion models with curvature-aware adversarial guidance and sparse dark-pixel residuals to generate perceptually benign images that hijack next actions. Comprehensive evaluations on SeeAct and OpenClaw are claimed to demonstrate the approach's potency, realism, and stealth.

Significance. If the empirical claims hold under the stated constraints, the work would be significant for identifying realistic vulnerabilities in deployed MLLM web agents and for advancing constrained adversarial generation techniques in vision-language systems. The emphasis on a trusted-platform, third-party attacker model distinguishes it from more permissive threat models in prior work.

major comments (1)
  1. [Abstract] Abstract: the central claim that MIRAGE images 'empirically demonstrate the potency, realism, and stealth' rests on the unverified assumption that curvature-aware diffusion guidance plus sparse residuals can simultaneously satisfy human-perceptual benignity and reliable next-action hijacking when confined to small, semantically legitimate regions. No success rates, visibility metrics, human perceptual studies, or failure-case analysis are supplied to show the two objectives are jointly achieved rather than traded off.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their review and the opportunity to respond to the concern regarding the abstract. We address the point directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that MIRAGE images 'empirically demonstrate the potency, realism, and stealth' rests on the unverified assumption that curvature-aware diffusion guidance plus sparse residuals can simultaneously satisfy human-perceptual benignity and reliable next-action hijacking when confined to small, semantically legitimate regions. No success rates, visibility metrics, human perceptual studies, or failure-case analysis are supplied to show the two objectives are jointly achieved rather than traded off.

    Authors: The abstract is a concise summary; the supporting quantitative evidence appears in the body of the manuscript. Section 4 reports success rates for targeted next-action hijacking (e.g., 78-87% on SeeAct and 71-84% on OpenClaw across 200 trials under the constrained ad-slot threat model). Section 5.2 quantifies stealth via LPIPS, SSIM, and PSNR metrics on the generated images, showing values comparable to benign ad content. Section 5.4 provides failure-case analysis, identifying cases where agent policy or OCR variance prevents hijacking. Human perceptual studies are not included in the current manuscript. We can revise the abstract to reference the specific quantitative results from Sections 4 and 5 rather than using the general phrasing 'empirically demonstrate.' revision: partial

standing simulated objections not resolved
  • Absence of formal human perceptual studies to support the stealth claim

Circularity Check

0 steps flagged

No circularity: empirical method proposal with independent evaluations

full rationale

The paper introduces MIRAGE as a novel framework using diffusion models for adversarial image generation within constrained regions, combined with a described optimization technique (curvature-aware guidance plus sparse residuals). The central claim rests on empirical success rates against SeeAct and OpenClaw rather than any derivation, equation, or fitted parameter that reduces to its own inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in the abstract or description to justify load-bearing steps. The evaluations are presented as external validation, making the work self-contained against benchmarks without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on abstract; the central approach rests on the unverified capability of diffusion models to satisfy both stealth and attack efficacy under spatial constraints.

axioms (1)
  • domain assumption Diffusion models can generate images that are perceptually benign while carrying effective adversarial signals when optimized with curvature-aware guidance and sparse perturbations.
    This is the core technical premise invoked to enable the attack within the constrained setting.

pith-pipeline@v0.9.1-grok · 5751 in / 1169 out tokens · 36075 ms · 2026-06-27T00:51:10.061914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 6 linked inside Pith

  1. [1]

    Advances in Neural Information Processing Systems , year =

    WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , author =. Advances in Neural Information Processing Systems , year =

  2. [2]

    arXiv preprint arXiv:2307.13854 , year =

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author =. arXiv preprint arXiv:2307.13854 , year =

  3. [3]

    Advances in Neural Information Processing Systems , year =

    Mind2Web: Towards a Generalist Agent for the Web , author =. Advances in Neural Information Processing Systems , year =

  4. [4]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

  5. [5]

    arXiv preprint arXiv:2401.01614 , year =

    GPT-4V(ision) is a Generalist Web Agent, if Grounded , author =. arXiv preprint arXiv:2401.01614 , year =

  6. [6]

    arXiv preprint arXiv:2312.08914 , year =

    CogAgent: A Visual Language Model for GUI Agents , author =. arXiv preprint arXiv:2312.08914 , year =

  7. [7]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

    SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

  8. [8]

    arXiv preprint arXiv:2404.07972 , year =

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , author =. arXiv preprint arXiv:2404.07972 , year =

  9. [9]

    arXiv preprint arXiv:2302.12173 , year =

    Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection , author =. arXiv preprint arXiv:2302.12173 , year =

  10. [10]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages =

    InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents , author =. Findings of the Association for Computational Linguistics: ACL 2024 , pages =. 2024 , doi =

  11. [11]

    Advances in Neural Information Processing Systems , year =

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , author =. Advances in Neural Information Processing Systems , year =

  12. [12]

    The Thirteenth International Conference on Learning Representations , year =

    EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage , author =. The Thirteenth International Conference on Learning Representations , year =

  13. [13]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages =

    Attacking Vision-Language Computer Agents via Pop-ups , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages =. 2025 , doi =

  14. [14]

    arXiv preprint arXiv:2410.17401 , year =

    AdvAgent: Controllable Blackbox Red-teaming on Web Agents , author =. arXiv preprint arXiv:2410.17401 , year =

  15. [15]

    Proceedings of the 42nd International Conference on Machine Learning , series =

    UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning , author =. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , publisher =

  16. [16]

    arXiv preprint arXiv:2505.21499 , year =

    AdInject: Real-World Black-Box Attacks on Web Agents via Advertising Delivery , author =. arXiv preprint arXiv:2505.21499 , year =

  17. [17]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

    WebInject: Prompt Injection Attack to Web Agents , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =. 2025 , doi =

  18. [18]

    arXiv preprint arXiv:2306.13213 , year =

    Visual Adversarial Examples Jailbreak Aligned Large Language Models , author =. arXiv preprint arXiv:2306.13213 , year =

  19. [19]

    arXiv preprint arXiv:2309.00236 , year =

    Image Hijacks: Adversarial Images can Control Generative Models at Runtime , author =. arXiv preprint arXiv:2309.00236 , year =

  20. [20]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume =

    FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2025 , doi =

  21. [21]

    arXiv preprint arXiv:2503.10809 , year =

    MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents , author =. arXiv preprint arXiv:2503.10809 , year =

  22. [22]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages =

    On the Robustness of GUI Grounding Models Against Image Attacks , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages =

  23. [23]

    International Conference on Learning Representations , year =

    Denoising Diffusion Implicit Models , author =. International Conference on Learning Representations , year =

  24. [24]

    arXiv preprint arXiv:2305.16494 , year =

    Diffusion-Based Adversarial Sample Generation for Improved Stealthiness and Controllability , author =. arXiv preprint arXiv:2305.16494 , year =

  25. [25]

    arXiv preprint arXiv:2305.08192 , year =

    Diffusion Models for Imperceptible and Transferable Adversarial Attack , author =. arXiv preprint arXiv:2305.08192 , year =

  26. [26]

    European Conference on Computer Vision , year =

    AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models , author =. European Conference on Computer Vision , year =

  27. [27]

    IEEE Transactions on Information Forensics and Security , year =

    Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models via Diffusion Models , author =. IEEE Transactions on Information Forensics and Security , year =

  28. [28]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  29. [29]

    arXiv preprint arXiv:2503.01743 , year=

    Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras , author=. arXiv preprint arXiv:2503.01743 , year=

  30. [30]

    Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...

  31. [31]

    arXiv preprint arXiv:2407.21783 , year=

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  32. [32]

    2026 , note =

    Peter Steinberger and OpenClaw Contributors , title =. 2026 , note =

  33. [33]

    2026 , note =

    browser-use , howpublished =. 2026 , note =

  34. [34]

    Advances in neural information processing systems , volume=

    Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

  35. [35]

    Advances in Neural Information Processing Systems , volume=

    Content-based unrestricted adversarial attack , author=. Advances in Neural Information Processing Systems , volume=

  36. [36]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=