WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections

Bryan Hooi; Hieu Cao; Khoi Le; Shuicheng Yan; Thong Nguyen; Tri Cao; Yibo Li; Yue Liu; Yuexin Li; Yufei He

arxiv: 2605.15030 · v1 · pith:C6QXZPSOnew · submitted 2026-05-14 · 💻 cs.CR · cs.AI

WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections

Tri Cao , Yulin Chen , Hieu Cao , Yibo Li , Khoi Le , Thong Nguyen , Yuexin Li , Yufei He

show 3 more authors

Yue Liu Shuicheng Yan Bryan Hooi

This is my paper

Pith reviewed 2026-06-30 20:14 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords web agentsprompt injectionadversarial defenseguard modeladversarial trainingweb securityAI agentsprompt injection attacks

0 comments

The pith

WARD is a guard model that defends web agents from prompt injections with near-perfect recall on unseen data, low false positives, and no added latency while resisting adaptive attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Web agents that browse and act on websites are open to prompt injection attacks hidden in page content. The paper presents WARD as a dedicated guard built on a 177K-sample dataset drawn from 719 real high-traffic sites plus a second dataset focused on attacks against the guard itself. It adds an iterative training loop in which an attacker and the guard improve against each other. The resulting model is claimed to generalize to new domains, avoid blocking normal pages, stay effective when attackers target the guard directly, and run alongside the agent without slowing it down. If these properties hold, web agents could be used more safely in open environments.

Core claim

WARD is a guard model trained on WARD-Base (approximately 177K samples from 719 high-traffic URLs) and WARD-PIG (a prompt-injection dataset targeting the guard), using the A3T adaptive adversarial attack training framework that performs memory-based attacker-guard co-evolution; it achieves nearly perfect recall on out-of-distribution benchmarks, low false-positive rates, robustness to guard-targeted and adaptive attacks under distribution shifts, and parallel execution with the agent that adds no latency.

What carries the argument

A3T adaptive adversarial attack training framework that iteratively strengthens the guard through memory-based attacker and guard co-evolution.

If this is right

Web agents can complete tasks on live websites with reduced risk of being hijacked by injected instructions.
Agent utility stays intact because benign content is rarely misclassified as malicious.
The defense continues to work when attackers deliberately craft inputs to fool the guard model itself.
No extra waiting time is added at each agent step because the guard runs in parallel.
The same training approach could be reused whenever new attack patterns appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The co-evolution training loop could be applied to other language-model agents that face similar injection risks outside web browsing.
Periodic retraining on newly collected pages would likely be required as website structures and attack techniques change.
Production deployment on actual user-facing agents would provide a direct test of whether the reported efficiency and robustness translate to real traffic.
The approach might extend to visual prompt injections in multimodal agents if the guard is updated to process screenshots alongside text.

Load-bearing premise

The datasets collected from high-traffic sites and the dedicated prompt-injection set are assumed to cover enough real-world content and attack variations for the reported generalization and robustness to hold.

What would settle it

Running WARD on a fresh collection of web pages and prompt-injection examples gathered after the original datasets and measuring whether recall falls below 95 percent or false-positive rate rises above 5 percent on those new examples.

Figures

Figures reproduced from arXiv: 2605.15030 by Bryan Hooi, Hieu Cao, Khoi Le, Shuicheng Yan, Thong Nguyen, Tri Cao, Yibo Li, Yue Liu, Yuexin Li, Yufei He, Yulin Chen.

**Figure 2.** Figure 2: Overview of the WARD framework. For each URL, we generate 5 benign user tasks I (4,040 total) and execute them using a Browser-Use agent [13] for up to 100 steps or until completion. At each step, we record the HTML H0 and screenshot S0, forming source tuples (H0, S0, I). Web Development. We construct simulated platforms for the native branch to replicate high-risk applications where prompt injection appea… view at source ↗

**Figure 3.** Figure 3: WARD-Base Statistics and Composition across Injection Locations, Attack Goal Types, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: (Robustness-A3T) Guard performance under A3T with different trial budgets. The first two [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of popup-based prompt injection across datasets. Left: a popup example from [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Representative failure case on Kleinanzeigen. The attack inserts a plausible “Smart Search” recommendation box that supplies a fabricated list of ten Berlin bike listings under 200 EUR. Because the injected content appears helpful, interface-consistent, and strongly aligned with the user’s task, the guard predicts benign [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Injection channel distribution in WARD-Base. The left chart shows overlay-branch channels, [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

read the original abstract

Web agents can autonomously complete online tasks by interacting with websites, but their exposure to open web environments makes them vulnerable to prompt injection attacks embedded in HTML content or visual interfaces. Existing guard models still suffer from limited generalization to unseen domains and attack patterns, high false positive rates on benign content, reduced deployment efficiency due to added latency at each step, and vulnerability to adversarial attacks that evolve over time or directly target the guard itself. To address these limitations, we propose WARD (Web Agent Robust Defense against Prompt Injection), a practical guard model for secure and efficient web agents. WARD is built on WARD-Base, a large-scale dataset with around 177K samples collected from 719 high-traffic URLs and platforms, and WARD-PIG, a dedicated dataset designed for prompt injection attacks targeting the guard model. We further introduce A3T, an adaptive adversarial attack training framework that iteratively strengthens WARD through a memory-based attacker and guard co-evolution process. Extensive experiments show that WARD achieves nearly perfect recall on out-of-distribution benchmarks, maintains low false positive rates to preserve agent utility, remains robust against guard-targeted and adaptive attacks under substantial distribution shifts, and runs efficiently in parallel with the agent without introducing additional latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New datasets and A3T co-evolution loop for web agent prompt injection defense, but no experimental details to check the robustness claims.

read the letter

The paper's main point is a guard model called WARD that uses a new 177K-sample dataset from 719 high-traffic URLs plus a dedicated attack set, trained via an iterative memory-based attacker-guard loop called A3T. The abstract claims near-perfect OOD recall, low false positives, robustness to adaptive and guard-targeted attacks, and no added latency.

What is actually new is the scale and source of WARD-Base, the WARD-PIG attack collection, and the co-evolution setup tailored to web agents. The focus on running in parallel with the agent and handling evolving threats is a direct response to known weaknesses in existing guards.

The approach is reasonable on paper for a practical security tool. Collecting from real high-traffic sites is better than pure synthetic data, and the iterative training tries to close the loop on adaptive attacks.

The soft spots are exactly where the stress-test note flags them. The abstract gives zero information on data splits, baselines, how OOD is defined, or how attacks outside the training loop were constructed. Without those, the generalization and robustness numbers cannot be evaluated. The 719 URLs may be too narrow in domain or structure to support the claimed coverage of real-world shifts. That is a load-bearing gap, not a minor one.

This is for researchers working on securing autonomous web agents. Someone building or evaluating defenses in that area could use the datasets if released.

It deserves peer review because the problem is timely and the method is concrete enough to test once the experiments are shown. Send it out so the claims can be checked against the actual results and setup.

Referee Report

2 major / 2 minor

Summary. The paper proposes WARD, a guard model for defending web agents against prompt injection attacks in HTML and visual interfaces. It is constructed from the WARD-Base dataset (~177K samples collected from 719 high-traffic URLs and platforms) and the WARD-PIG dataset for guard-targeted attacks, trained via the A3T framework that performs iterative co-evolution between a memory-based attacker and the guard. The central experimental claims are nearly perfect recall on out-of-distribution benchmarks, low false-positive rates that preserve agent utility, robustness to both guard-targeted and adaptive attacks under distribution shifts, and efficient parallel execution with no added latency.

Significance. If the reported results hold under rigorous verification, this would be a meaningful practical contribution to securing autonomous web agents, directly addressing the four stated limitations of prior guards (generalization, false positives, latency, and adversarial vulnerability). The scale of the collected data and the explicit co-evolutionary training loop are strengths that could influence future defense design.

major comments (2)

[§3] §3 (Dataset construction): The OOD generalization and robustness claims rest on the assumption that the 719 high-traffic URLs in WARD-Base provide sufficient domain, structural, and linguistic diversity to support performance under 'substantial distribution shifts.' No quantitative analysis (e.g., domain entropy, language distribution, or structural feature coverage) is supplied to substantiate this coverage, leaving open the possibility that reported OOD results reflect in-distribution behavior rather than true generalization.
[§4.3, §5] §4.3 and §5 (Evaluation and A3T): The robustness claims against adaptive and guard-targeted attacks are supported by A3T co-evolution, yet the paper does not report whether the held-out OOD benchmarks and WARD-PIG test splits were generated independently of the co-evolution loop or whether any post-hoc selection of attack variants occurred. This detail is load-bearing for the claim that robustness extends 'outside the training loop.'

minor comments (2)

[Abstract] The abstract states 'nearly perfect recall' and 'low false positive rates' without numerical values or confidence intervals; these should be stated explicitly with the corresponding tables or figures in the main text.
[§4.2] Notation for the memory-based attacker components in the A3T description could be clarified with a small diagram or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [§3] §3 (Dataset construction): The OOD generalization and robustness claims rest on the assumption that the 719 high-traffic URLs in WARD-Base provide sufficient domain, structural, and linguistic diversity to support performance under 'substantial distribution shifts.' No quantitative analysis (e.g., domain entropy, language distribution, or structural feature coverage) is supplied to substantiate this coverage, leaving open the possibility that reported OOD results reflect in-distribution behavior rather than true generalization.

Authors: We agree that quantitative diversity metrics would strengthen the justification for OOD claims. The 719 URLs were selected from high-traffic sources across multiple platforms to capture varied web structures, but explicit entropy or coverage statistics were omitted. In revision we will add domain, language, and structural feature analysis to §3. revision: yes
Referee: [§4.3, §5] §4.3 and §5 (Evaluation and A3T): The robustness claims against adaptive and guard-targeted attacks are supported by A3T co-evolution, yet the paper does not report whether the held-out OOD benchmarks and WARD-PIG test splits were generated independently of the co-evolution loop or whether any post-hoc selection of attack variants occurred. This detail is load-bearing for the claim that robustness extends 'outside the training loop.'

Authors: The OOD benchmarks and WARD-PIG test splits were constructed and held out independently of the A3T training loop; the co-evolution operated exclusively on training data with no post-hoc variant selection on test sets. We will add explicit statements confirming this separation in the revised §4.3 and §5. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on experimental evaluation of independently collected datasets

full rationale

The paper describes an empirical guard model (WARD) trained on two custom datasets (WARD-Base from 719 URLs, WARD-PIG) plus an adaptive co-evolution training loop (A3T). All central claims—recall, false-positive rates, robustness to adaptive/guard-targeted attacks—are presented as outcomes of experimental benchmarks rather than any derivation, equation, or fitted quantity defined in terms of itself. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text. Dataset coverage is an unverified modeling assumption but does not reduce any claimed result to its inputs by construction; the evaluation remains externally falsifiable on held-out benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no details on model parameters, training hyperparameters, or architectural assumptions; ledger left empty due to lack of information.

pith-pipeline@v0.9.1-grok · 5781 in / 1003 out tokens · 20866 ms · 2026-06-30T20:14:05.256277+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Same-Origin Policy for Agentic Browsers
cs.CR 2026-06 unverdicted novelty 7.0

The paper builds SOPBench showing frequent SOP violations in agentic browsers and introduces SOPGuard to enforce the policy with low overhead in BrowserOS.

Reference graph

Works this paper leans on

113 extracted references · 46 canonical work pages · cited by 1 Pith paper · 18 internal anchors

[1]

Attacking multimodal os agents with malicious image patches.arXiv preprint arXiv:2503.10809, 2025

Lukas Aichberger, Alasdair Paren, Yarin Gal, Philip Torr, and Adel Bibi. Attacking multimodal os agents with malicious image patches.arXiv preprint arXiv:2503.10809, 2025

work page arXiv 2025
[2]

Computer use

Anthropic. Computer use. https://docs.claude.com/en/docs/agents-and-tools/ tool-use/computer-use-tool, 2025. Accessed: 2025-09-24

2025
[3]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Phishagent: a robust multimodal agent for phishing webpage detection

Tri Cao, Chengyu Huang, Yuexin Li, Wang Huilin, Amy He, Nay Oo, and Bryan Hooi. Phishagent: a robust multimodal agent for phishing webpage detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27869–27877, 2025

2025
[6]

Phishagent: A robust multimodal agent for phishing webpage detection.arXiv preprint arXiv:2408.10738, 2024

Tri Cao, Chengyu Huang, Yuexin Li, Huilin Wang, Amy He, Nay Oo, and Bryan Hooi. Phishagent: A robust multimodal agent for phishing webpage detection.arXiv preprint arXiv:2408.10738, 2024

work page arXiv 2024
[7]

Vpi-bench: Visual prompt injection attacks for computer-use agents

Tri Cao, Bennett Lim, Yue Liu, Yuan Sui, Yuexin Li, Shumin Deng, Lin Lu, Nay Oo, Shuicheng Yan, and Bryan Hooi. Vpi-bench: Visual prompt injection attacks for computer-use agents. arXiv preprint arXiv:2506.02456, 2025

work page arXiv 2025
[8]

WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents

Yulin Chen, Tri Cao, Haoran Li, Yue Liu, Yibo Li, Yufei He, Le Minh Khoi, Yangqiu Song, Shuicheng Yan, and Bryan Hooi. Webagentguard: A reasoning-driven guard model for detecting prompt injection attacks in web agents.arXiv preprint arXiv:2604.12284, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Topicattack: An indirect prompt injection attack via topic transition

Yulin Chen, Haoran Li, Yuexin Li, Yue Liu, Yangqiu Song, and Bryan Hooi. Topicattack: An indirect prompt injection attack via topic transition. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7338–7356, 2025

2025
[10]

Backdoor-powered prompt injection attacks nullify defense methods

Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, and Bryan Hooi. Backdoor-powered prompt injection attacks nullify defense methods. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 4508–4527, 2025

2025
[11]

Llama guard 3 vision: Safeguarding human-ai image understanding conversations, 2024

Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti. Llama guard 3 vision: Safeguarding human-ai image understanding conversations.arXiv preprint arXiv:2411.10414, 2024

work page arXiv 2024
[12]

Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

2023
[13]

Browser-use agent documentation

Browser-Use Contributors. Browser-use agent documentation. https://docs.browser-use. com/introduction, 2025. Accessed: 2025-05-15

2025
[14]

A dataset of information-seeking questions and answers anchored in research papers

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, 2021. 10

2021
[15]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

2024
[16]

Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025

Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025

2025
[17]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023

2023
[18]

Eta: Evaluating then aligning safety of vision language models at inference time.arXiv preprint arXiv:2410.06625, 2024

Yi Ding, Bolian Li, and Ruqi Zhang. Eta: Evaluating then aligning safety of vision language models at inference time.arXiv preprint arXiv:2410.06625, 2024

work page arXiv 2024
[19]

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Mengyao Du, Han Fang, Haokai Ma, Jiahao Chen, Kai Xu, Quanjun Yin, and Ee-Chien Chang. Snapguard: Lightweight prompt injection detection for screenshot-based web agents.arXiv preprint arXiv:2604.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, and Kamalika Chaud- huri. Wasp: Benchmarking web agent security against prompt injection attacks.arXiv preprint arXiv:2504.18575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Imprompter: Tricking llm agents into improper tool use.arXiv preprint arXiv:2410.14923, 2024

Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Rajesh K Gupta, Taylor Berg-Kirkpatrick, and Earlence Fernandes. Imprompter: Tricking llm agents into improper tool use.arXiv preprint arXiv:2410.14923, 2024

work page arXiv 2024
[22]

PIArena: A Platform for Prompt Injection Evaluation

Runpeng Geng, Chenlong Yin, Yanting Wang, Ying Chen, and Jinyuan Jia. Piarena: A platform for prompt injection evaluation.arXiv preprint arXiv:2604.08499, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

A large- scale multi-document summarization dataset from the wikipedia current events portal

Demian Gholipour Ghalandari, Chris Hokamp, John Glover, Georgiana Ifrim, et al. A large- scale multi-document summarization dataset from the wikipedia current events portal. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1302–1308, 2020

2020
[24]

Immune: Improving safety against jailbreaks in multi-modal llms via inference-time alignment

Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Tianrui Guan, Mengdi Wang, Ahmad Beirami, Furong Huang, Alvaro Velasquez, Dinesh Manocha, and Amrit Singh Bedi. Immune: Improving safety against jailbreaks in multi-modal llms via inference-time alignment. arXiv preprint arXiv:2411.18688, 2024

work page arXiv 2024
[25]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models.arXiv preprint arXiv:2302.12173, 27, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Longcoder: A long-range pre-trained language model for code completion

Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. InInternational Conference on Machine Learning, pages 12098–12107. PMLR, 2023

2023
[27]

Making every step effective: Jailbreaking large vision-language models through hierarchical kv equalization.arXiv preprint arXiv:2503.11750, 2025

Shuyang Hao, Yiwei Wang, Bryan Hooi, Jun Liu, Muhao Chen, Zi Huang, and Yujun Cai. Making every step effective: Jailbreaking large vision-language models through hierarchical kv equalization.arXiv preprint arXiv:2503.11750, 2025

work page arXiv 2025
[28]

Safe rlhf-v: Safe reinforcement learning from human feedback in multimodal large language models.arXiv preprint arXiv:2503.17682, 2025

Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Conghui Zhang, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, et al. Safe rlhf-v: Safe reinforcement learning from human feedback in multimodal large language models.arXiv preprint arXiv:2503.17682, 2025

work page arXiv 2025
[29]

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024

Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024. 11

work page arXiv 2024
[30]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

2020
[31]

An empirical survey on long document summarization: Datasets, models, and metrics.ACM computing surveys, 55(8):1–35, 2022

Huan Yee Koh, Jiaxin Ju, Ming Liu, and Shirui Pan. An empirical survey on long document summarization: Datasets, models, and metrics.ACM computing surveys, 55(8):1–35, 2022

2022
[32]

Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

2019
[33]

{KnowPhish}: Large language models meet multimodal knowledge graphs for enhancing {Reference-Based} phishing detection

Yuexin Li, Chengyu Huang, Shumin Deng, Mei Lin Lock, Tri Cao, Nay Oo, Hoon Wei Lim, and Bryan Hooi. {KnowPhish}: Large language models meet multimodal knowledge graphs for enhancing {Reference-Based} phishing detection. In33rd USENIX Security Symposium (USENIX Security 24), pages 793–810, 2024

2024
[34]

Eia: Environmental injection attack on generalist web agents for privacy leakage.arXiv preprint arXiv:2409.11295, 2024

Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage.arXiv preprint arXiv:2409.11295, 2024

work page arXiv 2024
[35]

Eia: Environmental injection attack on generalist web agents for privacy leakage

Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[36]

Vigil: Defending llm agents against tool stream injection via verify-before-commit.arXiv preprint arXiv:2601.05755, 2026

Junda Lin, Zhaomeng Zhou, Zhi Zheng, Shuochen Liu, Tong Xu, Yong Chen, and Enhong Chen. Vigil: Defending llm agents against tool stream injection via verify-before-commit.arXiv preprint arXiv:2601.05755, 2026

work page arXiv 2026
[37]

Vlm-guard: Safeguarding vision-language models via fulfilling safety alignment gap.arXiv preprint arXiv:2502.10486, 2025

Qin Liu, Fei Wang, Chaowei Xiao, and Muhao Chen. Vlm-guard: Safeguarding vision-language models via fulfilling safety alignment gap.arXiv preprint arXiv:2502.10486, 2025

work page arXiv 2025
[38]

FlipAttack: Jailbreak LLMs via Flipping

Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, and Bryan Hooi. Flipattack: Jailbreak llms via flipping.arXiv preprint arXiv:2410.02832, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Guardreasoner-vl: Safeguarding vlms via reinforced reasoning.arXiv preprint arXiv:2505.11049, 2025

Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, et al. Guardreasoner-vl: Safeguarding vlms via reinforced reasoning.arXiv preprint arXiv:2505.11049, 2025

work page arXiv 2025
[40]

Datasentinel: A game-theoretic detection of prompt injection attacks

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. In2025 IEEE Symposium on Security and Privacy (SP), pages 2190–2208. IEEE, 2025

2025
[41]

Safety alignment for vision language models.arXiv preprint arXiv:2405.13581, 2024

Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xi- aoyong Zhu, and Bo Zheng. Safety alignment for vision language models.arXiv preprint arXiv:2405.13581, 2024

work page arXiv 2024
[42]

Model card - prompt guard

Meta. Model card - prompt guard. https://huggingface.co/meta-llama/ Prompt-Guard-86M, 2024

2024
[43]

Llama guard 4 model card

Meta. Llama guard 4 model card. https://www.llama.com/docs/ model-cards-and-prompt-formats/llama-guard-4/, 2025. Accessed: 2026-05-06

2025
[44]

Llama prompt guard 2 model card

Meta. Llama prompt guard 2 model card. https://huggingface.co/meta-llama/ Llama-Prompt-Guard-2-86M/, 2025

2025
[45]

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, et al. The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections. arXiv preprint arXiv:2510.09023, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

A survey of webagents: Towards next-generation ai agents for web automation with large foundation models

Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S Yu, et al. A survey of webagents: Towards next-generation ai agents for web automation with large foundation models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6140–6150, 2025

2025
[47]

Introducing gpt-oss-safeguard

OpenAI. Introducing gpt-oss-safeguard. https://openai.com/index/ introducing-gpt-oss-safeguard/, 2025. Accessed: 2026-05-06

2025
[48]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[49]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

2023
[52]

Know what you don’t know: Unanswerable ques- tions for squad

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable ques- tions for squad. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018

2018
[53]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models

Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. InThe Twelfth International Conference on Learning Representations, 2023

2023
[55]

Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, et al. Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025

work page arXiv 2025
[56]

Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku

Claude Team. Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku. https://www.anthropic.com/news/3-5-models-and-computer-use, 2024

2024
[57]

Openagentsafety: Aframeworkforevaluatingreal-worldaiagentsafety.arXiv preprint arXiv:2507.06134, 2026

Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Gra- ham Neubig, and Maarten Sap. Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety.arXiv preprint arXiv:2507.06134, 2025

work page arXiv 2025
[58]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Manipulating multimodal agents via cross-modal prompt injection

Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, and Xianglong Liu. Manipulating multimodal agents via cross-modal prompt injection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10955–10964, 2025

2025
[60]

Inferaligner: Inference-time alignment for harmlessness through cross-model guidance.arXiv preprint arXiv:2401.11206, 2024

Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, and Xipeng Qiu. Inferaligner: Inference-time alignment for harmlessness through cross-model guidance.arXiv preprint arXiv:2401.11206, 2024

work page arXiv 2024
[61]

Ideator: Jailbreaking and benchmarking large vision-language models using themselves

Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, and Yu-Gang Jiang. Ideator: Jailbreaking and benchmarking large vision-language models using themselves. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8875–8884, 2025. 13

2025
[62]

Webinject: Prompt injection attack to web agents

Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, and Neil Zhenqiang Gong. Webinject: Prompt injection attack to web agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2010–2030, 2025

2025
[63]

Websentinel: Detecting and localizing prompt injection attacks for web agents.arXiv preprint arXiv:2602.03792, 2026

Xilong Wang, Yinuo Liu, Zhun Wang, Dawn Song, and Neil Gong. Websentinel: Detecting and localizing prompt injection attacks for web agents.arXiv preprint arXiv:2602.03792, 2026

work page arXiv 2026
[64]

Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36, 2024

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36, 2024

2024
[65]

Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning

Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, et al. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning. InICML 2025 Workshop on Computer Use Agents, 2025

2025
[66]

Adversary-aware dpo: Enhancing safety alignment in vision language models via adversarial training.arXiv preprint arXiv:2502.11455, 2025

Fenghua Weng, Jian Lou, Jun Feng, Minlie Huang, and Wenjie Wang. Adversary-aware dpo: Enhancing safety alignment in vision language models via adversarial training.arXiv preprint arXiv:2502.11455, 2025

work page arXiv 2025
[67]

Delimiters won’t save you from prompt injection

Simon Willison. Delimiters won’t save you from prompt injection. https://simonwillison. net/2023/May/11/delimiters-wont-save-you, 2023

2023
[68]

Dissecting adversarial robustness of multimodal lm agents.arXiv preprint arXiv:2406.12814,

Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, and Aditi Raghunathan. Adversarial attacks on multimodal agents.arXiv preprint arXiv:2406.12814, 2024

work page arXiv 2024
[69]

Wipi: A new web threat for llm-driven web agents.arXiv preprint arXiv:2402.16965, 2024

Fangzhou Wu, Shutong Wu, Yulong Cao, and Chaowei Xiao. Wipi: A new web threat for llm-driven web agents.arXiv preprint arXiv:2402.16965, 2024

work page arXiv 2024
[70]

Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

work page arXiv 2025
[71]

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, et al. Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning.arXiv preprint arXiv:2406.09187, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

Advweb: Controllable black-box attacks on vlm-powered web agents.arXiv preprint arXiv:2410.17401, 2024

Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. Advweb: Controllable black-box attacks on vlm-powered web agents.arXiv preprint arXiv:2410.17401, 2024

work page arXiv 2024
[73]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

2018
[75]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

2022
[76]

A survey of safety on large vision-language models: Attacks, defenses and evaluations.arXiv preprint arXiv:2502.14881, 2025

Mang Ye, Xuankun Rong, Wenke Huang, Bo Du, Nenghai Yu, and Dacheng Tao. A survey of safety on large vision-language models: Attacks, defenses and evaluations.arXiv preprint arXiv:2502.14881, 2025

work page arXiv 2025
[77]

Superagent guard: Frontier guardrails for ai agents

Alan Zabihi. Superagent guard: Frontier guardrails for ai agents. https://www.superagent. sh/blog/superagent-guard-frontier-guardrails-for-ai-agents, 2026

2026
[78]

Browsesafe: Understanding and preventing prompt injection within ai browser agents.arXiv preprint arXiv:2511.20597, 2025

Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats, and Ninghui Li. Browsesafe: Understanding and preventing prompt injection within ai browser agents.arXiv preprint arXiv:2511.20597, 2025. 14

work page arXiv 2025
[79]

Mir- rorguard: Toward secure computer-use agents via simulation-to-real reasoning correction.arXiv preprint arXiv:2601.12822, 2026

Wenqi Zhang, Yulin Shen, Changyue Jiang, Jiarun Dai, Geng Hong, and Xudong Pan. Mir- rorguard: Toward secure computer-use agents via simulation-to-real reasoning correction.arXiv preprint arXiv:2601.12822, 2026

work page arXiv 2026
[80]

Attacking vision-language computer agents via pop-ups

Yanzhe Zhang, Tao Yu, and Diyi Yang. Attacking vision-language computer agents via pop-ups. arXiv preprint arXiv:2411.02391, 2024

work page arXiv 2024

Showing first 80 references.

[1] [1]

Attacking multimodal os agents with malicious image patches.arXiv preprint arXiv:2503.10809, 2025

Lukas Aichberger, Alasdair Paren, Yarin Gal, Philip Torr, and Adel Bibi. Attacking multimodal os agents with malicious image patches.arXiv preprint arXiv:2503.10809, 2025

work page arXiv 2025

[2] [2]

Computer use

Anthropic. Computer use. https://docs.claude.com/en/docs/agents-and-tools/ tool-use/computer-use-tool, 2025. Accessed: 2025-09-24

2025

[3] [3]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Phishagent: a robust multimodal agent for phishing webpage detection

Tri Cao, Chengyu Huang, Yuexin Li, Wang Huilin, Amy He, Nay Oo, and Bryan Hooi. Phishagent: a robust multimodal agent for phishing webpage detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27869–27877, 2025

2025

[6] [6]

Phishagent: A robust multimodal agent for phishing webpage detection.arXiv preprint arXiv:2408.10738, 2024

Tri Cao, Chengyu Huang, Yuexin Li, Huilin Wang, Amy He, Nay Oo, and Bryan Hooi. Phishagent: A robust multimodal agent for phishing webpage detection.arXiv preprint arXiv:2408.10738, 2024

work page arXiv 2024

[7] [7]

Vpi-bench: Visual prompt injection attacks for computer-use agents

Tri Cao, Bennett Lim, Yue Liu, Yuan Sui, Yuexin Li, Shumin Deng, Lin Lu, Nay Oo, Shuicheng Yan, and Bryan Hooi. Vpi-bench: Visual prompt injection attacks for computer-use agents. arXiv preprint arXiv:2506.02456, 2025

work page arXiv 2025

[8] [8]

WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents

Yulin Chen, Tri Cao, Haoran Li, Yue Liu, Yibo Li, Yufei He, Le Minh Khoi, Yangqiu Song, Shuicheng Yan, and Bryan Hooi. Webagentguard: A reasoning-driven guard model for detecting prompt injection attacks in web agents.arXiv preprint arXiv:2604.12284, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Topicattack: An indirect prompt injection attack via topic transition

Yulin Chen, Haoran Li, Yuexin Li, Yue Liu, Yangqiu Song, and Bryan Hooi. Topicattack: An indirect prompt injection attack via topic transition. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7338–7356, 2025

2025

[10] [10]

Backdoor-powered prompt injection attacks nullify defense methods

Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, and Bryan Hooi. Backdoor-powered prompt injection attacks nullify defense methods. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 4508–4527, 2025

2025

[11] [11]

Llama guard 3 vision: Safeguarding human-ai image understanding conversations, 2024

Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti. Llama guard 3 vision: Safeguarding human-ai image understanding conversations.arXiv preprint arXiv:2411.10414, 2024

work page arXiv 2024

[12] [12]

Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

2023

[13] [13]

Browser-use agent documentation

Browser-Use Contributors. Browser-use agent documentation. https://docs.browser-use. com/introduction, 2025. Accessed: 2025-05-15

2025

[14] [14]

A dataset of information-seeking questions and answers anchored in research papers

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, 2021. 10

2021

[15] [15]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

2024

[16] [16]

Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025

Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025

2025

[17] [17]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023

2023

[18] [18]

Eta: Evaluating then aligning safety of vision language models at inference time.arXiv preprint arXiv:2410.06625, 2024

Yi Ding, Bolian Li, and Ruqi Zhang. Eta: Evaluating then aligning safety of vision language models at inference time.arXiv preprint arXiv:2410.06625, 2024

work page arXiv 2024

[19] [19]

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Mengyao Du, Han Fang, Haokai Ma, Jiahao Chen, Kai Xu, Quanjun Yin, and Ee-Chien Chang. Snapguard: Lightweight prompt injection detection for screenshot-based web agents.arXiv preprint arXiv:2604.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, and Kamalika Chaud- huri. Wasp: Benchmarking web agent security against prompt injection attacks.arXiv preprint arXiv:2504.18575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Imprompter: Tricking llm agents into improper tool use.arXiv preprint arXiv:2410.14923, 2024

Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Rajesh K Gupta, Taylor Berg-Kirkpatrick, and Earlence Fernandes. Imprompter: Tricking llm agents into improper tool use.arXiv preprint arXiv:2410.14923, 2024

work page arXiv 2024

[22] [22]

PIArena: A Platform for Prompt Injection Evaluation

Runpeng Geng, Chenlong Yin, Yanting Wang, Ying Chen, and Jinyuan Jia. Piarena: A platform for prompt injection evaluation.arXiv preprint arXiv:2604.08499, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

A large- scale multi-document summarization dataset from the wikipedia current events portal

Demian Gholipour Ghalandari, Chris Hokamp, John Glover, Georgiana Ifrim, et al. A large- scale multi-document summarization dataset from the wikipedia current events portal. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1302–1308, 2020

2020

[24] [24]

Immune: Improving safety against jailbreaks in multi-modal llms via inference-time alignment

Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Tianrui Guan, Mengdi Wang, Ahmad Beirami, Furong Huang, Alvaro Velasquez, Dinesh Manocha, and Amrit Singh Bedi. Immune: Improving safety against jailbreaks in multi-modal llms via inference-time alignment. arXiv preprint arXiv:2411.18688, 2024

work page arXiv 2024

[25] [25]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models.arXiv preprint arXiv:2302.12173, 27, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Longcoder: A long-range pre-trained language model for code completion

Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. InInternational Conference on Machine Learning, pages 12098–12107. PMLR, 2023

2023

[27] [27]

Making every step effective: Jailbreaking large vision-language models through hierarchical kv equalization.arXiv preprint arXiv:2503.11750, 2025

Shuyang Hao, Yiwei Wang, Bryan Hooi, Jun Liu, Muhao Chen, Zi Huang, and Yujun Cai. Making every step effective: Jailbreaking large vision-language models through hierarchical kv equalization.arXiv preprint arXiv:2503.11750, 2025

work page arXiv 2025

[28] [28]

Safe rlhf-v: Safe reinforcement learning from human feedback in multimodal large language models.arXiv preprint arXiv:2503.17682, 2025

Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Conghui Zhang, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, et al. Safe rlhf-v: Safe reinforcement learning from human feedback in multimodal large language models.arXiv preprint arXiv:2503.17682, 2025

work page arXiv 2025

[29] [29]

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024

Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024. 11

work page arXiv 2024

[30] [30]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

2020

[31] [31]

An empirical survey on long document summarization: Datasets, models, and metrics.ACM computing surveys, 55(8):1–35, 2022

Huan Yee Koh, Jiaxin Ju, Ming Liu, and Shirui Pan. An empirical survey on long document summarization: Datasets, models, and metrics.ACM computing surveys, 55(8):1–35, 2022

2022

[32] [32]

Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

2019

[33] [33]

{KnowPhish}: Large language models meet multimodal knowledge graphs for enhancing {Reference-Based} phishing detection

Yuexin Li, Chengyu Huang, Shumin Deng, Mei Lin Lock, Tri Cao, Nay Oo, Hoon Wei Lim, and Bryan Hooi. {KnowPhish}: Large language models meet multimodal knowledge graphs for enhancing {Reference-Based} phishing detection. In33rd USENIX Security Symposium (USENIX Security 24), pages 793–810, 2024

2024

[34] [34]

Eia: Environmental injection attack on generalist web agents for privacy leakage.arXiv preprint arXiv:2409.11295, 2024

Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage.arXiv preprint arXiv:2409.11295, 2024

work page arXiv 2024

[35] [35]

Eia: Environmental injection attack on generalist web agents for privacy leakage

Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[36] [36]

Vigil: Defending llm agents against tool stream injection via verify-before-commit.arXiv preprint arXiv:2601.05755, 2026

Junda Lin, Zhaomeng Zhou, Zhi Zheng, Shuochen Liu, Tong Xu, Yong Chen, and Enhong Chen. Vigil: Defending llm agents against tool stream injection via verify-before-commit.arXiv preprint arXiv:2601.05755, 2026

work page arXiv 2026

[37] [37]

Vlm-guard: Safeguarding vision-language models via fulfilling safety alignment gap.arXiv preprint arXiv:2502.10486, 2025

Qin Liu, Fei Wang, Chaowei Xiao, and Muhao Chen. Vlm-guard: Safeguarding vision-language models via fulfilling safety alignment gap.arXiv preprint arXiv:2502.10486, 2025

work page arXiv 2025

[38] [38]

FlipAttack: Jailbreak LLMs via Flipping

Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, and Bryan Hooi. Flipattack: Jailbreak llms via flipping.arXiv preprint arXiv:2410.02832, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Guardreasoner-vl: Safeguarding vlms via reinforced reasoning.arXiv preprint arXiv:2505.11049, 2025

Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, et al. Guardreasoner-vl: Safeguarding vlms via reinforced reasoning.arXiv preprint arXiv:2505.11049, 2025

work page arXiv 2025

[40] [40]

Datasentinel: A game-theoretic detection of prompt injection attacks

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. In2025 IEEE Symposium on Security and Privacy (SP), pages 2190–2208. IEEE, 2025

2025

[41] [41]

Safety alignment for vision language models.arXiv preprint arXiv:2405.13581, 2024

Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xi- aoyong Zhu, and Bo Zheng. Safety alignment for vision language models.arXiv preprint arXiv:2405.13581, 2024

work page arXiv 2024

[42] [42]

Model card - prompt guard

Meta. Model card - prompt guard. https://huggingface.co/meta-llama/ Prompt-Guard-86M, 2024

2024

[43] [43]

Llama guard 4 model card

Meta. Llama guard 4 model card. https://www.llama.com/docs/ model-cards-and-prompt-formats/llama-guard-4/, 2025. Accessed: 2026-05-06

2025

[44] [44]

Llama prompt guard 2 model card

Meta. Llama prompt guard 2 model card. https://huggingface.co/meta-llama/ Llama-Prompt-Guard-2-86M/, 2025

2025

[45] [45]

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, et al. The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections. arXiv preprint arXiv:2510.09023, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

A survey of webagents: Towards next-generation ai agents for web automation with large foundation models

Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S Yu, et al. A survey of webagents: Towards next-generation ai agents for web automation with large foundation models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6140–6150, 2025

2025

[47] [47]

Introducing gpt-oss-safeguard

OpenAI. Introducing gpt-oss-safeguard. https://openai.com/index/ introducing-gpt-oss-safeguard/, 2025. Accessed: 2026-05-06

2025

[48] [48]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022

[49] [49]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[50] [50]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

2023

[52] [52]

Know what you don’t know: Unanswerable ques- tions for squad

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable ques- tions for squad. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018

2018

[53] [53]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models

Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. InThe Twelfth International Conference on Learning Representations, 2023

2023

[55] [55]

Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, et al. Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025

work page arXiv 2025

[56] [56]

Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku

Claude Team. Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku. https://www.anthropic.com/news/3-5-models-and-computer-use, 2024

2024

[57] [57]

Openagentsafety: Aframeworkforevaluatingreal-worldaiagentsafety.arXiv preprint arXiv:2507.06134, 2026

Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Gra- ham Neubig, and Maarten Sap. Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety.arXiv preprint arXiv:2507.06134, 2025

work page arXiv 2025

[58] [58]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Manipulating multimodal agents via cross-modal prompt injection

Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, and Xianglong Liu. Manipulating multimodal agents via cross-modal prompt injection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10955–10964, 2025

2025

[60] [60]

Inferaligner: Inference-time alignment for harmlessness through cross-model guidance.arXiv preprint arXiv:2401.11206, 2024

Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, and Xipeng Qiu. Inferaligner: Inference-time alignment for harmlessness through cross-model guidance.arXiv preprint arXiv:2401.11206, 2024

work page arXiv 2024

[61] [61]

Ideator: Jailbreaking and benchmarking large vision-language models using themselves

Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, and Yu-Gang Jiang. Ideator: Jailbreaking and benchmarking large vision-language models using themselves. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8875–8884, 2025. 13

2025

[62] [62]

Webinject: Prompt injection attack to web agents

Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, and Neil Zhenqiang Gong. Webinject: Prompt injection attack to web agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2010–2030, 2025

2025

[63] [63]

Websentinel: Detecting and localizing prompt injection attacks for web agents.arXiv preprint arXiv:2602.03792, 2026

Xilong Wang, Yinuo Liu, Zhun Wang, Dawn Song, and Neil Gong. Websentinel: Detecting and localizing prompt injection attacks for web agents.arXiv preprint arXiv:2602.03792, 2026

work page arXiv 2026

[64] [64]

Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36, 2024

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36, 2024

2024

[65] [65]

Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning

Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, et al. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning. InICML 2025 Workshop on Computer Use Agents, 2025

2025

[66] [66]

Adversary-aware dpo: Enhancing safety alignment in vision language models via adversarial training.arXiv preprint arXiv:2502.11455, 2025

Fenghua Weng, Jian Lou, Jun Feng, Minlie Huang, and Wenjie Wang. Adversary-aware dpo: Enhancing safety alignment in vision language models via adversarial training.arXiv preprint arXiv:2502.11455, 2025

work page arXiv 2025

[67] [67]

Delimiters won’t save you from prompt injection

Simon Willison. Delimiters won’t save you from prompt injection. https://simonwillison. net/2023/May/11/delimiters-wont-save-you, 2023

2023

[68] [68]

Dissecting adversarial robustness of multimodal lm agents.arXiv preprint arXiv:2406.12814,

Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, and Aditi Raghunathan. Adversarial attacks on multimodal agents.arXiv preprint arXiv:2406.12814, 2024

work page arXiv 2024

[69] [69]

Wipi: A new web threat for llm-driven web agents.arXiv preprint arXiv:2402.16965, 2024

Fangzhou Wu, Shutong Wu, Yulong Cao, and Chaowei Xiao. Wipi: A new web threat for llm-driven web agents.arXiv preprint arXiv:2402.16965, 2024

work page arXiv 2024

[70] [70]

Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

work page arXiv 2025

[71] [71]

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, et al. Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning.arXiv preprint arXiv:2406.09187, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[72] [72]

Advweb: Controllable black-box attacks on vlm-powered web agents.arXiv preprint arXiv:2410.17401, 2024

Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. Advweb: Controllable black-box attacks on vlm-powered web agents.arXiv preprint arXiv:2410.17401, 2024

work page arXiv 2024

[73] [73]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[74] [74]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

2018

[75] [75]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

2022

[76] [76]

A survey of safety on large vision-language models: Attacks, defenses and evaluations.arXiv preprint arXiv:2502.14881, 2025

Mang Ye, Xuankun Rong, Wenke Huang, Bo Du, Nenghai Yu, and Dacheng Tao. A survey of safety on large vision-language models: Attacks, defenses and evaluations.arXiv preprint arXiv:2502.14881, 2025

work page arXiv 2025

[77] [77]

Superagent guard: Frontier guardrails for ai agents

Alan Zabihi. Superagent guard: Frontier guardrails for ai agents. https://www.superagent. sh/blog/superagent-guard-frontier-guardrails-for-ai-agents, 2026

2026

[78] [78]

Browsesafe: Understanding and preventing prompt injection within ai browser agents.arXiv preprint arXiv:2511.20597, 2025

Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats, and Ninghui Li. Browsesafe: Understanding and preventing prompt injection within ai browser agents.arXiv preprint arXiv:2511.20597, 2025. 14

work page arXiv 2025

[79] [79]

Mir- rorguard: Toward secure computer-use agents via simulation-to-real reasoning correction.arXiv preprint arXiv:2601.12822, 2026

Wenqi Zhang, Yulin Shen, Changyue Jiang, Jiarun Dai, Geng Hong, and Xudong Pan. Mir- rorguard: Toward secure computer-use agents via simulation-to-real reasoning correction.arXiv preprint arXiv:2601.12822, 2026

work page arXiv 2026

[80] [80]

Attacking vision-language computer agents via pop-ups

Yanzhe Zhang, Tao Yu, and Diyi Yang. Attacking vision-language computer agents via pop-ups. arXiv preprint arXiv:2411.02391, 2024

work page arXiv 2024