pith. sign in

arxiv: 2605.15030 · v1 · pith:C6QXZPSOnew · submitted 2026-05-14 · 💻 cs.CR · cs.AI

WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections

Pith reviewed 2026-06-30 20:14 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords web agentsprompt injectionadversarial defenseguard modeladversarial trainingweb securityAI agentsprompt injection attacks
0
0 comments X

The pith

WARD is a guard model that defends web agents from prompt injections with near-perfect recall on unseen data, low false positives, and no added latency while resisting adaptive attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Web agents that browse and act on websites are open to prompt injection attacks hidden in page content. The paper presents WARD as a dedicated guard built on a 177K-sample dataset drawn from 719 real high-traffic sites plus a second dataset focused on attacks against the guard itself. It adds an iterative training loop in which an attacker and the guard improve against each other. The resulting model is claimed to generalize to new domains, avoid blocking normal pages, stay effective when attackers target the guard directly, and run alongside the agent without slowing it down. If these properties hold, web agents could be used more safely in open environments.

Core claim

WARD is a guard model trained on WARD-Base (approximately 177K samples from 719 high-traffic URLs) and WARD-PIG (a prompt-injection dataset targeting the guard), using the A3T adaptive adversarial attack training framework that performs memory-based attacker-guard co-evolution; it achieves nearly perfect recall on out-of-distribution benchmarks, low false-positive rates, robustness to guard-targeted and adaptive attacks under distribution shifts, and parallel execution with the agent that adds no latency.

What carries the argument

A3T adaptive adversarial attack training framework that iteratively strengthens the guard through memory-based attacker and guard co-evolution.

If this is right

  • Web agents can complete tasks on live websites with reduced risk of being hijacked by injected instructions.
  • Agent utility stays intact because benign content is rarely misclassified as malicious.
  • The defense continues to work when attackers deliberately craft inputs to fool the guard model itself.
  • No extra waiting time is added at each agent step because the guard runs in parallel.
  • The same training approach could be reused whenever new attack patterns appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The co-evolution training loop could be applied to other language-model agents that face similar injection risks outside web browsing.
  • Periodic retraining on newly collected pages would likely be required as website structures and attack techniques change.
  • Production deployment on actual user-facing agents would provide a direct test of whether the reported efficiency and robustness translate to real traffic.
  • The approach might extend to visual prompt injections in multimodal agents if the guard is updated to process screenshots alongside text.

Load-bearing premise

The datasets collected from high-traffic sites and the dedicated prompt-injection set are assumed to cover enough real-world content and attack variations for the reported generalization and robustness to hold.

What would settle it

Running WARD on a fresh collection of web pages and prompt-injection examples gathered after the original datasets and measuring whether recall falls below 95 percent or false-positive rate rises above 5 percent on those new examples.

Figures

Figures reproduced from arXiv: 2605.15030 by Bryan Hooi, Hieu Cao, Khoi Le, Shuicheng Yan, Thong Nguyen, Tri Cao, Yibo Li, Yue Liu, Yuexin Li, Yufei He, Yulin Chen.

Figure 1
Figure 1. Figure 1: Comparison of WARD with the best performing guard model baselines across detection [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the WARD framework. For each URL, we generate 5 benign user tasks I (4,040 total) and execute them using a Browser-Use agent [13] for up to 100 steps or until completion. At each step, we record the HTML H0 and screenshot S0, forming source tuples (H0, S0, I). Web Development. We construct simulated platforms for the native branch to replicate high-risk applications where prompt injection appea… view at source ↗
Figure 3
Figure 3. Figure 3: WARD-Base Statistics and Composition across Injection Locations, Attack Goal Types, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (Robustness-A3T) Guard performance under A3T with different trial budgets. The first two [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of popup-based prompt injection across datasets. Left: a popup example from [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative failure case on Kleinanzeigen. The attack inserts a plausible “Smart Search” recommendation box that supplies a fabricated list of ten Berlin bike listings under 200 EUR. Because the injected content appears helpful, interface-consistent, and strongly aligned with the user’s task, the guard predicts benign [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Injection channel distribution in WARD-Base. The left chart shows overlay-branch channels, [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
read the original abstract

Web agents can autonomously complete online tasks by interacting with websites, but their exposure to open web environments makes them vulnerable to prompt injection attacks embedded in HTML content or visual interfaces. Existing guard models still suffer from limited generalization to unseen domains and attack patterns, high false positive rates on benign content, reduced deployment efficiency due to added latency at each step, and vulnerability to adversarial attacks that evolve over time or directly target the guard itself. To address these limitations, we propose WARD (Web Agent Robust Defense against Prompt Injection), a practical guard model for secure and efficient web agents. WARD is built on WARD-Base, a large-scale dataset with around 177K samples collected from 719 high-traffic URLs and platforms, and WARD-PIG, a dedicated dataset designed for prompt injection attacks targeting the guard model. We further introduce A3T, an adaptive adversarial attack training framework that iteratively strengthens WARD through a memory-based attacker and guard co-evolution process. Extensive experiments show that WARD achieves nearly perfect recall on out-of-distribution benchmarks, maintains low false positive rates to preserve agent utility, remains robust against guard-targeted and adaptive attacks under substantial distribution shifts, and runs efficiently in parallel with the agent without introducing additional latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes WARD, a guard model for defending web agents against prompt injection attacks in HTML and visual interfaces. It is constructed from the WARD-Base dataset (~177K samples collected from 719 high-traffic URLs and platforms) and the WARD-PIG dataset for guard-targeted attacks, trained via the A3T framework that performs iterative co-evolution between a memory-based attacker and the guard. The central experimental claims are nearly perfect recall on out-of-distribution benchmarks, low false-positive rates that preserve agent utility, robustness to both guard-targeted and adaptive attacks under distribution shifts, and efficient parallel execution with no added latency.

Significance. If the reported results hold under rigorous verification, this would be a meaningful practical contribution to securing autonomous web agents, directly addressing the four stated limitations of prior guards (generalization, false positives, latency, and adversarial vulnerability). The scale of the collected data and the explicit co-evolutionary training loop are strengths that could influence future defense design.

major comments (2)
  1. [§3] §3 (Dataset construction): The OOD generalization and robustness claims rest on the assumption that the 719 high-traffic URLs in WARD-Base provide sufficient domain, structural, and linguistic diversity to support performance under 'substantial distribution shifts.' No quantitative analysis (e.g., domain entropy, language distribution, or structural feature coverage) is supplied to substantiate this coverage, leaving open the possibility that reported OOD results reflect in-distribution behavior rather than true generalization.
  2. [§4.3, §5] §4.3 and §5 (Evaluation and A3T): The robustness claims against adaptive and guard-targeted attacks are supported by A3T co-evolution, yet the paper does not report whether the held-out OOD benchmarks and WARD-PIG test splits were generated independently of the co-evolution loop or whether any post-hoc selection of attack variants occurred. This detail is load-bearing for the claim that robustness extends 'outside the training loop.'
minor comments (2)
  1. [Abstract] The abstract states 'nearly perfect recall' and 'low false positive rates' without numerical values or confidence intervals; these should be stated explicitly with the corresponding tables or figures in the main text.
  2. [§4.2] Notation for the memory-based attacker components in the A3T description could be clarified with a small diagram or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset construction): The OOD generalization and robustness claims rest on the assumption that the 719 high-traffic URLs in WARD-Base provide sufficient domain, structural, and linguistic diversity to support performance under 'substantial distribution shifts.' No quantitative analysis (e.g., domain entropy, language distribution, or structural feature coverage) is supplied to substantiate this coverage, leaving open the possibility that reported OOD results reflect in-distribution behavior rather than true generalization.

    Authors: We agree that quantitative diversity metrics would strengthen the justification for OOD claims. The 719 URLs were selected from high-traffic sources across multiple platforms to capture varied web structures, but explicit entropy or coverage statistics were omitted. In revision we will add domain, language, and structural feature analysis to §3. revision: yes

  2. Referee: [§4.3, §5] §4.3 and §5 (Evaluation and A3T): The robustness claims against adaptive and guard-targeted attacks are supported by A3T co-evolution, yet the paper does not report whether the held-out OOD benchmarks and WARD-PIG test splits were generated independently of the co-evolution loop or whether any post-hoc selection of attack variants occurred. This detail is load-bearing for the claim that robustness extends 'outside the training loop.'

    Authors: The OOD benchmarks and WARD-PIG test splits were constructed and held out independently of the A3T training loop; the co-evolution operated exclusively on training data with no post-hoc variant selection on test sets. We will add explicit statements confirming this separation in the revised §4.3 and §5. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on experimental evaluation of independently collected datasets

full rationale

The paper describes an empirical guard model (WARD) trained on two custom datasets (WARD-Base from 719 URLs, WARD-PIG) plus an adaptive co-evolution training loop (A3T). All central claims—recall, false-positive rates, robustness to adaptive/guard-targeted attacks—are presented as outcomes of experimental benchmarks rather than any derivation, equation, or fitted quantity defined in terms of itself. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text. Dataset coverage is an unverified modeling assumption but does not reduce any claimed result to its inputs by construction; the evaluation remains externally falsifiable on held-out benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no details on model parameters, training hyperparameters, or architectural assumptions; ledger left empty due to lack of information.

pith-pipeline@v0.9.1-grok · 5781 in / 1003 out tokens · 20866 ms · 2026-06-30T20:14:05.256277+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Same-Origin Policy for Agentic Browsers

    cs.CR 2026-06 unverdicted novelty 7.0

    The paper builds SOPBench showing frequent SOP violations in agentic browsers and introduces SOPGuard to enforce the policy with low overhead in BrowserOS.

Reference graph

Works this paper leans on

113 extracted references · 46 canonical work pages · cited by 1 Pith paper · 18 internal anchors

  1. [1]

    Attacking multimodal os agents with malicious image patches.arXiv preprint arXiv:2503.10809, 2025

    Lukas Aichberger, Alasdair Paren, Yarin Gal, Philip Torr, and Adel Bibi. Attacking multimodal os agents with malicious image patches.arXiv preprint arXiv:2503.10809, 2025

  2. [2]

    Computer use

    Anthropic. Computer use. https://docs.claude.com/en/docs/agents-and-tools/ tool-use/computer-use-tool, 2025. Accessed: 2025-09-24

  3. [3]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

  4. [4]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268, 2016

  5. [5]

    Phishagent: a robust multimodal agent for phishing webpage detection

    Tri Cao, Chengyu Huang, Yuexin Li, Wang Huilin, Amy He, Nay Oo, and Bryan Hooi. Phishagent: a robust multimodal agent for phishing webpage detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27869–27877, 2025

  6. [6]

    Phishagent: A robust multimodal agent for phishing webpage detection.arXiv preprint arXiv:2408.10738, 2024

    Tri Cao, Chengyu Huang, Yuexin Li, Huilin Wang, Amy He, Nay Oo, and Bryan Hooi. Phishagent: A robust multimodal agent for phishing webpage detection.arXiv preprint arXiv:2408.10738, 2024

  7. [7]

    Vpi-bench: Visual prompt injection attacks for computer-use agents

    Tri Cao, Bennett Lim, Yue Liu, Yuan Sui, Yuexin Li, Shumin Deng, Lin Lu, Nay Oo, Shuicheng Yan, and Bryan Hooi. Vpi-bench: Visual prompt injection attacks for computer-use agents. arXiv preprint arXiv:2506.02456, 2025

  8. [8]

    WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents

    Yulin Chen, Tri Cao, Haoran Li, Yue Liu, Yibo Li, Yufei He, Le Minh Khoi, Yangqiu Song, Shuicheng Yan, and Bryan Hooi. Webagentguard: A reasoning-driven guard model for detecting prompt injection attacks in web agents.arXiv preprint arXiv:2604.12284, 2026

  9. [9]

    Topicattack: An indirect prompt injection attack via topic transition

    Yulin Chen, Haoran Li, Yuexin Li, Yue Liu, Yangqiu Song, and Bryan Hooi. Topicattack: An indirect prompt injection attack via topic transition. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7338–7356, 2025

  10. [10]

    Backdoor-powered prompt injection attacks nullify defense methods

    Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, and Bryan Hooi. Backdoor-powered prompt injection attacks nullify defense methods. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 4508–4527, 2025

  11. [11]

    Llama guard 3 vision: Safeguarding human-ai image understanding conversations, 2024

    Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti. Llama guard 3 vision: Safeguarding human-ai image understanding conversations.arXiv preprint arXiv:2411.10414, 2024

  12. [12]

    Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

    Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

  13. [13]

    Browser-use agent documentation

    Browser-Use Contributors. Browser-use agent documentation. https://docs.browser-use. com/introduction, 2025. Accessed: 2025-05-15

  14. [14]

    A dataset of information-seeking questions and answers anchored in research papers

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, 2021. 10

  15. [15]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

  16. [16]

    Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025

    Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025

  17. [17]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023

  18. [18]

    Eta: Evaluating then aligning safety of vision language models at inference time.arXiv preprint arXiv:2410.06625, 2024

    Yi Ding, Bolian Li, and Ruqi Zhang. Eta: Evaluating then aligning safety of vision language models at inference time.arXiv preprint arXiv:2410.06625, 2024

  19. [19]

    SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

    Mengyao Du, Han Fang, Haokai Ma, Jiahao Chen, Kai Xu, Quanjun Yin, and Ee-Chien Chang. Snapguard: Lightweight prompt injection detection for screenshot-based web agents.arXiv preprint arXiv:2604.25562, 2026

  20. [20]

    WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

    Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, and Kamalika Chaud- huri. Wasp: Benchmarking web agent security against prompt injection attacks.arXiv preprint arXiv:2504.18575, 2025

  21. [21]

    Imprompter: Tricking llm agents into improper tool use.arXiv preprint arXiv:2410.14923, 2024

    Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Rajesh K Gupta, Taylor Berg-Kirkpatrick, and Earlence Fernandes. Imprompter: Tricking llm agents into improper tool use.arXiv preprint arXiv:2410.14923, 2024

  22. [22]

    PIArena: A Platform for Prompt Injection Evaluation

    Runpeng Geng, Chenlong Yin, Yanting Wang, Ying Chen, and Jinyuan Jia. Piarena: A platform for prompt injection evaluation.arXiv preprint arXiv:2604.08499, 2026

  23. [23]

    A large- scale multi-document summarization dataset from the wikipedia current events portal

    Demian Gholipour Ghalandari, Chris Hokamp, John Glover, Georgiana Ifrim, et al. A large- scale multi-document summarization dataset from the wikipedia current events portal. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1302–1308, 2020

  24. [24]

    Immune: Improving safety against jailbreaks in multi-modal llms via inference-time alignment

    Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Tianrui Guan, Mengdi Wang, Ahmad Beirami, Furong Huang, Alvaro Velasquez, Dinesh Manocha, and Amrit Singh Bedi. Immune: Improving safety against jailbreaks in multi-modal llms via inference-time alignment. arXiv preprint arXiv:2411.18688, 2024

  25. [25]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models.arXiv preprint arXiv:2302.12173, 27, 2023

  26. [26]

    Longcoder: A long-range pre-trained language model for code completion

    Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. InInternational Conference on Machine Learning, pages 12098–12107. PMLR, 2023

  27. [27]

    Making every step effective: Jailbreaking large vision-language models through hierarchical kv equalization.arXiv preprint arXiv:2503.11750, 2025

    Shuyang Hao, Yiwei Wang, Bryan Hooi, Jun Liu, Muhao Chen, Zi Huang, and Yujun Cai. Making every step effective: Jailbreaking large vision-language models through hierarchical kv equalization.arXiv preprint arXiv:2503.11750, 2025

  28. [28]

    Safe rlhf-v: Safe reinforcement learning from human feedback in multimodal large language models.arXiv preprint arXiv:2503.17682, 2025

    Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Conghui Zhang, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, et al. Safe rlhf-v: Safe reinforcement learning from human feedback in multimodal large language models.arXiv preprint arXiv:2503.17682, 2025

  29. [29]

    Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024

    Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024. 11

  30. [30]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

  31. [31]

    An empirical survey on long document summarization: Datasets, models, and metrics.ACM computing surveys, 55(8):1–35, 2022

    Huan Yee Koh, Jiaxin Ju, Ming Liu, and Shirui Pan. An empirical survey on long document summarization: Datasets, models, and metrics.ACM computing surveys, 55(8):1–35, 2022

  32. [32]

    Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

  33. [33]

    {KnowPhish}: Large language models meet multimodal knowledge graphs for enhancing {Reference-Based} phishing detection

    Yuexin Li, Chengyu Huang, Shumin Deng, Mei Lin Lock, Tri Cao, Nay Oo, Hoon Wei Lim, and Bryan Hooi. {KnowPhish}: Large language models meet multimodal knowledge graphs for enhancing {Reference-Based} phishing detection. In33rd USENIX Security Symposium (USENIX Security 24), pages 793–810, 2024

  34. [34]

    Eia: Environmental injection attack on generalist web agents for privacy leakage.arXiv preprint arXiv:2409.11295, 2024

    Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage.arXiv preprint arXiv:2409.11295, 2024

  35. [35]

    Eia: Environmental injection attack on generalist web agents for privacy leakage

    Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage. InThe Thirteenth International Conference on Learning Representations, 2025

  36. [36]

    Vigil: Defending llm agents against tool stream injection via verify-before-commit.arXiv preprint arXiv:2601.05755, 2026

    Junda Lin, Zhaomeng Zhou, Zhi Zheng, Shuochen Liu, Tong Xu, Yong Chen, and Enhong Chen. Vigil: Defending llm agents against tool stream injection via verify-before-commit.arXiv preprint arXiv:2601.05755, 2026

  37. [37]

    Vlm-guard: Safeguarding vision-language models via fulfilling safety alignment gap.arXiv preprint arXiv:2502.10486, 2025

    Qin Liu, Fei Wang, Chaowei Xiao, and Muhao Chen. Vlm-guard: Safeguarding vision-language models via fulfilling safety alignment gap.arXiv preprint arXiv:2502.10486, 2025

  38. [38]

    FlipAttack: Jailbreak LLMs via Flipping

    Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, and Bryan Hooi. Flipattack: Jailbreak llms via flipping.arXiv preprint arXiv:2410.02832, 2024

  39. [39]

    Guardreasoner-vl: Safeguarding vlms via reinforced reasoning.arXiv preprint arXiv:2505.11049, 2025

    Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, et al. Guardreasoner-vl: Safeguarding vlms via reinforced reasoning.arXiv preprint arXiv:2505.11049, 2025

  40. [40]

    Datasentinel: A game-theoretic detection of prompt injection attacks

    Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. In2025 IEEE Symposium on Security and Privacy (SP), pages 2190–2208. IEEE, 2025

  41. [41]

    Safety alignment for vision language models.arXiv preprint arXiv:2405.13581, 2024

    Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xi- aoyong Zhu, and Bo Zheng. Safety alignment for vision language models.arXiv preprint arXiv:2405.13581, 2024

  42. [42]

    Model card - prompt guard

    Meta. Model card - prompt guard. https://huggingface.co/meta-llama/ Prompt-Guard-86M, 2024

  43. [43]

    Llama guard 4 model card

    Meta. Llama guard 4 model card. https://www.llama.com/docs/ model-cards-and-prompt-formats/llama-guard-4/, 2025. Accessed: 2026-05-06

  44. [44]

    Llama prompt guard 2 model card

    Meta. Llama prompt guard 2 model card. https://huggingface.co/meta-llama/ Llama-Prompt-Guard-2-86M/, 2025

  45. [45]

    The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

    Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, et al. The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections. arXiv preprint arXiv:2510.09023, 2025. 12

  46. [46]

    A survey of webagents: Towards next-generation ai agents for web automation with large foundation models

    Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S Yu, et al. A survey of webagents: Towards next-generation ai agents for web automation with large foundation models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6140–6150, 2025

  47. [47]

    Introducing gpt-oss-safeguard

    OpenAI. Introducing gpt-oss-safeguard. https://openai.com/index/ introducing-gpt-oss-safeguard/, 2025. Accessed: 2026-05-06

  48. [48]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  49. [49]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

  50. [50]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  51. [51]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

  52. [52]

    Know what you don’t know: Unanswerable ques- tions for squad

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable ques- tions for squad. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018

  53. [53]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  54. [54]

    Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models

    Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. InThe Twelfth International Conference on Learning Representations, 2023

  55. [55]

    Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025

    Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, et al. Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025

  56. [56]

    Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku

    Claude Team. Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku. https://www.anthropic.com/news/3-5-models-and-computer-use, 2024

  57. [57]

    Openagentsafety: Aframeworkforevaluatingreal-worldaiagentsafety.arXiv preprint arXiv:2507.06134, 2026

    Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Gra- ham Neubig, and Maarten Sap. Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety.arXiv preprint arXiv:2507.06134, 2025

  58. [58]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

  59. [59]

    Manipulating multimodal agents via cross-modal prompt injection

    Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, and Xianglong Liu. Manipulating multimodal agents via cross-modal prompt injection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10955–10964, 2025

  60. [60]

    Inferaligner: Inference-time alignment for harmlessness through cross-model guidance.arXiv preprint arXiv:2401.11206, 2024

    Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, and Xipeng Qiu. Inferaligner: Inference-time alignment for harmlessness through cross-model guidance.arXiv preprint arXiv:2401.11206, 2024

  61. [61]

    Ideator: Jailbreaking and benchmarking large vision-language models using themselves

    Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, and Yu-Gang Jiang. Ideator: Jailbreaking and benchmarking large vision-language models using themselves. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8875–8884, 2025. 13

  62. [62]

    Webinject: Prompt injection attack to web agents

    Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, and Neil Zhenqiang Gong. Webinject: Prompt injection attack to web agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2010–2030, 2025

  63. [63]

    Websentinel: Detecting and localizing prompt injection attacks for web agents.arXiv preprint arXiv:2602.03792, 2026

    Xilong Wang, Yinuo Liu, Zhun Wang, Dawn Song, and Neil Gong. Websentinel: Detecting and localizing prompt injection attacks for web agents.arXiv preprint arXiv:2602.03792, 2026

  64. [64]

    Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36, 2024

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36, 2024

  65. [65]

    Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning

    Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, et al. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning. InICML 2025 Workshop on Computer Use Agents, 2025

  66. [66]

    Adversary-aware dpo: Enhancing safety alignment in vision language models via adversarial training.arXiv preprint arXiv:2502.11455, 2025

    Fenghua Weng, Jian Lou, Jun Feng, Minlie Huang, and Wenjie Wang. Adversary-aware dpo: Enhancing safety alignment in vision language models via adversarial training.arXiv preprint arXiv:2502.11455, 2025

  67. [67]

    Delimiters won’t save you from prompt injection

    Simon Willison. Delimiters won’t save you from prompt injection. https://simonwillison. net/2023/May/11/delimiters-wont-save-you, 2023

  68. [68]

    Dissecting adversarial robustness of multimodal lm agents.arXiv preprint arXiv:2406.12814,

    Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, and Aditi Raghunathan. Adversarial attacks on multimodal agents.arXiv preprint arXiv:2406.12814, 2024

  69. [69]

    Wipi: A new web threat for llm-driven web agents.arXiv preprint arXiv:2402.16965, 2024

    Fangzhou Wu, Shutong Wu, Yulong Cao, and Chaowei Xiao. Wipi: A new web threat for llm-driven web agents.arXiv preprint arXiv:2402.16965, 2024

  70. [70]

    Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

    Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

  71. [71]

    GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

    Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, et al. Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning.arXiv preprint arXiv:2406.09187, 2024

  72. [72]

    Advweb: Controllable black-box attacks on vlm-powered web agents.arXiv preprint arXiv:2410.17401, 2024

    Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. Advweb: Controllable black-box attacks on vlm-powered web agents.arXiv preprint arXiv:2410.17401, 2024

  73. [73]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

  74. [74]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

  75. [75]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  76. [76]

    A survey of safety on large vision-language models: Attacks, defenses and evaluations.arXiv preprint arXiv:2502.14881, 2025

    Mang Ye, Xuankun Rong, Wenke Huang, Bo Du, Nenghai Yu, and Dacheng Tao. A survey of safety on large vision-language models: Attacks, defenses and evaluations.arXiv preprint arXiv:2502.14881, 2025

  77. [77]

    Superagent guard: Frontier guardrails for ai agents

    Alan Zabihi. Superagent guard: Frontier guardrails for ai agents. https://www.superagent. sh/blog/superagent-guard-frontier-guardrails-for-ai-agents, 2026

  78. [78]

    Browsesafe: Understanding and preventing prompt injection within ai browser agents.arXiv preprint arXiv:2511.20597, 2025

    Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats, and Ninghui Li. Browsesafe: Understanding and preventing prompt injection within ai browser agents.arXiv preprint arXiv:2511.20597, 2025. 14

  79. [79]

    Mir- rorguard: Toward secure computer-use agents via simulation-to-real reasoning correction.arXiv preprint arXiv:2601.12822, 2026

    Wenqi Zhang, Yulin Shen, Changyue Jiang, Jiarun Dai, Geng Hong, and Xudong Pan. Mir- rorguard: Toward secure computer-use agents via simulation-to-real reasoning correction.arXiv preprint arXiv:2601.12822, 2026

  80. [80]

    Attacking vision-language computer agents via pop-ups

    Yanzhe Zhang, Tao Yu, and Diyi Yang. Attacking vision-language computer agents via pop-ups. arXiv preprint arXiv:2411.02391, 2024

Showing first 80 references.