SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
Pith reviewed 2026-05-18 08:12 UTC · model grok-4.3
The pith
A new benchmark shows that all tested vision-language web agents remain vulnerable to subtle adversarial manipulations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SecureWebArena supplies the first unified suite for security testing of LVLM-based web agents through six simulated environments, 2970 trajectories, a taxonomy of six attack vectors that cover both user-level and environment-level manipulations, and a three-part evaluation protocol that separately examines internal reasoning, full behavioral trajectories, and final task outcomes. Application of this suite to nine models spanning general-purpose, agent-specialized, and GUI-grounded categories establishes that every tested agent fails under subtle adversarial inputs and that model specialization introduces measurable security trade-offs.
What carries the argument
The unified evaluation suite that combines six realistic web environments with a structured taxonomy of six attack vectors and a multi-layered protocol for scoring failures in reasoning, trajectory, and outcome.
If this is right
- Web-agent designs must add explicit protections against both prompt changes and site-level manipulations to reach reliable real-world use.
- Agent-specialized models require targeted security training to offset the vulnerabilities that accompany their performance gains.
- Security checks for these agents should routinely inspect reasoning steps and action sequences rather than measuring only final task success.
- The benchmark supplies a reusable foundation that future work can extend to develop agents suitable for trustworthy online automation.
Where Pith is reading between the lines
- Real-world deployment on live websites may surface additional attack surfaces not captured inside the simulated environments.
- Combining general and specialized models in a single system could reduce the security trade-offs seen when using either type alone.
- Continuous monitoring of agent reasoning during operation might allow early detection of the subtle manipulations the benchmark identifies.
Load-bearing premise
The six simulated environments and the six attack categories are broad enough to represent the security threats that would appear when these agents run on actual live websites.
What would settle it
Running one of the nine tested agents on a real e-commerce or forum site and successfully triggering one of the benchmark attack vectors to produce the predicted failure mode would support the results; repeated inability to reproduce the same failures on live sites would weaken the claim that the benchmark captures meaningful risks.
Figures
read the original abstract
Large vision-language model (LVLM)-based web agents are emerging as powerful tools for automating complex online tasks. However, when deployed in real-world environments, they face serious security risks, motivating the design of security evaluation benchmarks. Existing benchmarks provide only partial coverage, typically restricted to narrow scenarios such as user-level prompt manipulation, and thus fail to capture the broad range of agent vulnerabilities. To address this gap, we present \tool{}, the first holistic benchmark for evaluating the security of LVLM-based web agents. \tool{} first introduces a unified evaluation suite comprising six simulated but realistic web environments (\eg, e-commerce platforms, community forums) and includes 2,970 high-quality trajectories spanning diverse tasks and attack settings. The suite defines a structured taxonomy of six attack vectors spanning both user-level and environment-level manipulations. In addition, we introduce a multi-layered evaluation protocol that analyzes agent failures across three critical dimensions: internal reasoning, behavioral trajectory, and task outcome, facilitating a fine-grained risk analysis that goes far beyond simple success metrics. Using this benchmark, we conduct large-scale experiments on 9 representative LVLMs, which fall into three categories: general-purpose, agent-specialized, and GUI-grounded. Our results show that all tested agents are consistently vulnerable to subtle adversarial manipulations and reveal critical trade-offs between model specialization and security. By providing (1) a comprehensive benchmark suite with diverse environments and a multi-layered evaluation pipeline, and (2) empirical insights into the security challenges of modern LVLM-based web agents, \tool{} establishes a foundation for advancing trustworthy web agent deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SecureWebArena as the first holistic security benchmark for LVLM-based web agents. It comprises six simulated realistic web environments (e.g., e-commerce and forums), 2,970 trajectories, a taxonomy of six attack vectors covering user- and environment-level manipulations, and a multi-layered evaluation protocol assessing internal reasoning, behavioral trajectories, and task outcomes. Large-scale experiments on nine LVLMs across general-purpose, agent-specialized, and GUI-grounded categories show that all agents are vulnerable to subtle adversarial manipulations, with observed trade-offs between model specialization and security.
Significance. If the simulation fidelity and evaluation protocol hold, this benchmark fills a gap in partial existing evaluations by providing comprehensive coverage and fine-grained failure analysis, establishing a foundation for trustworthy LVLM web agent deployment. The scale (nine models, nearly three thousand trajectories) and multi-dimensional protocol are strengths that enable more nuanced risk assessment than success-rate-only metrics.
major comments (1)
- [Abstract and §3 (Benchmark Construction)] The central claim that all tested agents are consistently vulnerable and that the benchmark captures broad real-world threats rests on the six simulated environments and six-vector taxonomy. However, the manuscript provides no explicit validation (e.g., comparison of observable states, DOM dynamics, or authentication redirects) that these simulations reproduce the security-relevant behaviors of live web sites; without such evidence the reported failure rates and specialization-security trade-offs may not transfer.
minor comments (2)
- [Abstract] The abstract summarizes high-level outcomes but supplies no quantitative results, error bars, or statistical controls; moving at least one key table or figure summary into the abstract would strengthen immediate verifiability.
- [§4 (Experiments)] Clarify whether the 2,970 trajectories include balanced coverage across all six attack vectors and environments, or whether some combinations are underrepresented.
Simulated Author's Rebuttal
We thank the referee for highlighting the importance of simulation fidelity in establishing the transferability of our benchmark results. We agree that explicit validation would strengthen the central claims regarding agent vulnerabilities and the broad applicability of the observed trade-offs. We address this comment below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3 (Benchmark Construction)] The central claim that all tested agents are consistently vulnerable and that the benchmark captures broad real-world threats rests on the six simulated environments and six-vector taxonomy. However, the manuscript provides no explicit validation (e.g., comparison of observable states, DOM dynamics, or authentication redirects) that these simulations reproduce the security-relevant behaviors of live web sites; without such evidence the reported failure rates and specialization-security trade-offs may not transfer.
Authors: We acknowledge that the current manuscript does not present a direct side-by-side empirical validation (such as quantitative comparisons of DOM trees, state transitions, or authentication redirect behaviors) between the simulated environments and live websites. The environments were implemented using standard web frameworks to replicate core interaction patterns and security surfaces observed in real platforms (e.g., dynamic product updates in e-commerce and threaded discussions in forums). However, we recognize that this design rationale alone is insufficient to fully address transferability concerns. In the revised manuscript, we will expand §3 with a dedicated subsection on environment construction that includes concrete examples of replicated observable states and attack surfaces. We will also add an explicit limitations paragraph discussing potential discrepancies with live deployments and how future work could incorporate real-site validation. These changes will allow readers to better evaluate the generalizability of the reported failure rates and specialization-security trade-offs. revision: yes
Circularity Check
Empirical benchmark creation and testing exhibits no circularity
full rationale
This paper introduces a new security evaluation benchmark for LVLM-based web agents, defines six simulated environments and a taxonomy of six attack vectors, generates 2,970 trajectories, and reports empirical results from testing nine existing models across three categories. The central claims consist of observed failure rates and specialization-security trade-offs derived directly from these experiments. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text; the work is self-contained as an empirical evaluation effort without any reduction of results to self-referential inputs or self-citation chains.
Axiom & Free-Parameter Ledger
free parameters (2)
- Selection and design of the six web environments
- Definition of the six attack vectors in the taxonomy
axioms (1)
- domain assumption Simulated web environments can serve as valid proxies for evaluating real-world security risks to deployed LVLM-based agents.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SecureWebArena first introduces a unified evaluation suite comprising six simulated but realistic web environments ... and a structured taxonomy of six attack vectors spanning both user-level and environment-level manipulations.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
2025.Introducing Claude 3.7 Sonnet and Claude Code
Anthropic. 2025.Introducing Claude 3.7 Sonnet and Claude Code. https://www. anthropic.com/news/claude-3-7-sonnet Accessed: 2025-10-04
work page 2025
-
[3]
Anthropic. 2025.Introducing Claude 4. https://www.anthropic.com/news/claude- 4 Accessed: 2025-10-04
work page 2025
-
[4]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems36 (2023), 28091–28114
work page 2023
- [6]
- [7]
-
[8]
Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, and Kamalika Chaudhuri. 2025. Wasp: Benchmarking web agent security against prompt injection attacks.arXiv preprint arXiv:2504.18575(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security. 79–90
work page 2023
-
[10]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, et al
-
[11]
Seed1.5-VL Technical Report. arXiv:2505.07062 [cs.CV] https://arxiv.org/ abs/2505.07062
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Jiaxing Huang, Jingyi Zhang, Kai Jiang, Han Qiu, Xiaoqin Zhang, Ling Shao, Shijian Lu, and Dacheng Tao. 2025. Visual Instruction Tuning towards General- Purpose Multimodal Large Language Model: A Survey.International Journal of Computer Vision(2025), 1–39
work page 2025
-
[13]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [14]
-
[15]
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried
-
[16]
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [17]
-
[18]
Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, et al. 2024. Autowebglm: A large language model-based web navigating agent. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5295–5306
work page 2024
-
[19]
Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov
- [20]
-
[21]
Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han
- [22]
-
[23]
Aishan Liu, Jun Guo, Jiakai Wang, Siyuan Liang, Renshuai Tao, Wenbo Zhou, Cong Liu, Xianglong Liu, and Dacheng Tao. 2023. X-Adv: Physical Adversarial Object Attacks against X-ray Prohibited Item Detection. InUSENIX Security Symposium
work page 2023
-
[24]
Aishan Liu, Tairan Huang, Xianglong Liu, Yitao Xu, Yuqing Ma, Xinyun Chen, Stephen J Maybank, and Dacheng Tao. 2020. Spatiotemporal attacks for embodied agents. InECCV
work page 2020
-
[25]
Aishan Liu, Xianglong Liu, Jiaxin Fan, Yuqing Ma, Anlan Zhang, Huiyuan Xie, and Dacheng Tao. 2019. Perceptual-sensitive gan for generating adversarial patches. InAAAI
work page 2019
-
[26]
Aishan Liu, Xianglong Liu, Hang Yu, Chongzhi Zhang, Qiang Liu, and Dacheng Tao. 2021. Training robust deep neural networks via adversarial noise propagation. TIP(2021)
work page 2021
-
[27]
Aishan Liu, Jiakai Wang, Xianglong Liu, Bowen Cao, Chongzhi Zhang, and Hang Yu. 2020. Bias-based universal adversarial patch attack for automatic check-out. InECCV
work page 2020
- [28]
-
[29]
Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, and Hai Zhao. 2025. Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 22324–22339
work page 2025
- [30]
-
[31]
Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S Yu, et al. 2025. A survey of webagents: Towards next-generation ai agents for web automation with large foundation models. InProceedings of the 31st ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining V. 2. 6140–6150
work page 2025
-
[32]
OpenAI. 2025.GPT-5 is here. https://openai.com/gpt-5/ Accessed: 2025-10-04. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al
work page 2025
-
[33]
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. 2025. UI-TARS: Pioneering Automated GUI Interaction with Native Agents.arXiv preprint arXiv:2501.12326 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [34]
- [35]
-
[36]
Settaluri Lakshmi Sravanthi, Ankit Mishra, Debjyoti Mondal, Subhadarshi Panda, Rituraj Singh, and Pushpak Bhattacharyya. 2025. From Perception to Reasoning: Enhancing Vision-Language Models for Mobile UI Understanding. InFindings of the Association for Computational Linguistics: ACL 2025. 25250–25269
work page 2025
-
[37]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, et al. [n. d.]. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multi-modal reasoning with scalable reinforcement learning, 2025.URL https://arxiv. org/abs/2507.01006([n. d.])
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [39]
- [40]
- [41]
-
[42]
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems 36 (2023), 80079–80110
work page 2023
-
[44]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837
work page 2022
-
[45]
Yisong Xiao, Aishan Liu, Tianlin Li, and Xianglong Liu. 2023. Latent imitator: Generating natural individual discriminatory instances for black-box fairness testing. InISSTA
work page 2023
-
[46]
Yisong Xiao, Aishan Liu, Siyuan Liang, Xianglong Liu, and Dacheng Tao. 2025. Fairness mediator: Neutralize stereotype associations to mitigate bias in large language models. InISSTA
work page 2025
- [47]
-
[48]
Yisong Xiao, Aishan Liu, Tianyuan Zhang, Haotong Qin, Jinyang Guo, and Xianglong Liu. 2023. Robustmq: benchmarking robustness of quantized models. Visual Intelligence(2023)
work page 2023
-
[49]
Yisong Xiao, Aishan Liu, Xinwei Zhang, Tianyuan Zhang, Tianlin Li, Siyuan Liang, Xianglong Liu, Yang Liu, and Dacheng Tao. 2025. BDefects4NN: A Back- door Defect Database for Controlled Localization Studies in Neural Networks. In ICSE
work page 2025
-
[50]
Yisong Xiao, Xianglong Liu, QianJia Cheng, Zhenfei Yin, Siyuan Liang, Jiapeng Li, Jing Shao, Aishan Liu, and Dacheng Tao. 2025. GenderBias-VL: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing: Y. Xiao et al.International Journal of Computer Vision(2025)
work page 2025
-
[51]
Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. 2024. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao
-
[53]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems35 (2022), 20744–20757
work page 2022
- [55]
- [56]
-
[57]
Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xiang- long Liu, and Dacheng Tao. 2025. Jailbreak vision language models via bi-modal adversarial prompt.IEEE Transactions on Information Forensics and Security (2025)
work page 2025
- [58]
- [59]
- [60]
- [61]
- [62]
-
[63]
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. 2025. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi
-
[65]
How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 14322–14350
- [66]
-
[67]
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. 2023. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking
Quanchen Zou, Zonghao Ying, Moyang Chen, Wenzhuo Xu, Yisong Xiao, Yakai Li, Deyue Zhang, Dongdong Yang, Zhao Liu, and Xiangzheng Zhang. 2025. PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreak- ing.arXiv preprint arXiv:2507.21540(2025). A Appendix A.1 Environment Examples Fig. 5 presents representative screenshots from the 6...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.