arxiv: 2604.06367 · v1 · submitted 2026-04-07 · 💻 cs.CR · cs.AI· cs.LG

Recognition: no theorem link

WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

Guruprasad Viswanathan Ramesh , Asmit Nayak , Basieem Siddique , Kassem Fawaz

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:44 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords web agentssecurity and privacy tasksevaluation benchmarkstateful UI elementsmultimodal modelsbrowser automationprivacy settingstask failure analysis

0 comments

The pith

Current web agents fail more than 45 percent of the time on security and privacy tasks that use stateful UI elements such as toggles and checkboxes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper fills a gap by creating WebSP-Eval, a benchmark with 200 hand-crafted task instances spanning 28 websites that tests whether web agents can complete everyday user security and privacy actions such as setting cookie preferences or revoking sessions. It supplies a supporting system that resets account states consistently via a custom browser extension and then runs eight different agents built on multimodal large language models. The results show these agents lack reliable autonomous exploration skills, perform unevenly across task categories and sites, and encounter their highest failure rates on pages containing stateful controls. A sympathetic reader would care because web agents are already being deployed for routine browser work, so their inability to manage privacy settings reliably could expose users to unwanted tracking or data leaks.

Core claim

WebSP-Eval demonstrates that state-of-the-art multimodal agents exhibit limited autonomous exploration when executing website security and privacy tasks, leading to poor performance on specific task categories and websites, with stateful UI elements such as toggles and checkboxes emerging as the dominant failure mode at rates exceeding 45 percent across many models.

What carries the argument

WebSP-Eval framework, consisting of the 200-task dataset, a Chrome extension for consistent account and state initialization, and an automated evaluator; the framework isolates performance drops tied to stateful UI components.

If this is right

Developers of web agents must prioritize better handling of dynamic, state-dependent controls to raise success rates on privacy tasks.
Future benchmarks for web agents should include dedicated security and privacy task suites to expose these weaknesses systematically.
Performance gaps across websites indicate that agent training or prompting needs site-specific adaptation rather than generic approaches.
The state-management extension enables repeatable evaluation, allowing direct comparison of future agent improvements on the same tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If stateful elements are the main bottleneck, training corpora for agents could be enriched with many more examples of checkbox and toggle interactions inside privacy flows.
The observed exploration limits may point to a wider difficulty for agents in maintaining context across multi-step, state-changing web sessions beyond security tasks.
Widespread adoption of such agents without fixes could inadvertently reduce user control over personal data settings on popular sites.

Load-bearing premise

The 200 manually written tasks across 28 websites represent the actual diversity and frequency of real-world user-facing security and privacy interactions, and the custom extension maintains identical starting states without introducing artifacts.

What would settle it

Re-running the same agents on a fresh collection of tasks that deliberately varies the proportion and types of stateful UI elements and websites, then measuring whether the failure rate on those elements drops below 45 percent or stays stable.

Figures

Figures reproduced from arXiv: 2604.06367 by Asmit Nayak, Basieem Siddique, Guruprasad Viswanathan Ramesh, Kassem Fawaz.

**Figure 2.** Figure 2: Modules of the WebSP-Eval evaluation framework: 1) Task Curation – Curation of a dataset consisting of website security and privacy tasks across websites. 2) Agent Instantiation – A novel web agent deployment supporting account and state management, utilizing an MLLM and a Selenium driven backbone to execute actions 3) Automated Verification – An automated Vision Language Model-based judge to assess agent … view at source ↗

**Figure 3.** Figure 3: Failure example highlighting website specific design on Steam (Gemini-3-Pro, [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Success example for Gemma-3-27b on the W/oNav variant. Given the instruction “Disable the notifications for cake day updates.”, the model successfully navigates to the page and clicks on the Cake Day updates option and disables notifications. Step 1: Click [9] Step 2: Click [12] Step 3: Click [18] Step 4: Click [29] Step 5: Answer- Task Solved [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗

**Figure 5.** Figure 5: Success example for Claude-Haiku-4.5 on the [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of Gemini-3-Pro trajectories on Twitch with (top) and without (bottom) explicit navigational [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: Failure example highlighting website specific design on Duolingo (Gemini-2.5-Pro, [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

**Figure 8.** Figure 8: Failure example highlighting a Cookie & Tracking Consent Management task failure on Docker (GPT-5- [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗

read the original abstract

Web agents automate browser tasks, ranging from simple form completion to complex workflows like ordering groceries. While current benchmarks evaluate general-purpose performance~(e.g., WebArena) or safety against malicious actions~(e.g., SafeArena), no existing framework assesses an agent's ability to successfully execute user-facing website security and privacy tasks, such as managing cookie preferences, configuring privacy-sensitive account settings, or revoking inactive sessions. To address this gap, we introduce WebSP-Eval, an evaluation framework for measuring web agent performance on website security and privacy tasks. WebSP-Eval comprises 1) a manually crafted task dataset of 200 task instances across 28 websites; 2) a robust agentic system supporting account and initial state management across runs using a custom Google Chrome extension; and 3) an automated evaluator. We evaluate a total of 8 web agent instantiations using state-of-the-art multimodal large language models, conducting a fine-grained analysis across websites, task categories, and UI elements. Our evaluation reveals that current models suffer from limited autonomous exploration capabilities to reliably solve website security and privacy tasks, and struggle with specific task categories and websites. Crucially, we identify stateful UI elements such as toggles and checkboxes are a primary reason for agent failure, failing at a rate of more than 45\% in tasks containing these elements across many models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WebSP-Eval fills a real gap with the first benchmark for web agents on privacy tasks and flags clear failure modes, but the methods section needs more detail to hold up.

read the letter

This paper introduces WebSP-Eval, the first framework to test web agents specifically on user-facing security and privacy tasks like managing cookies or account settings. The central finding is that current models have trouble with autonomous exploration and fail more than 45% of the time when tasks involve stateful UI elements such as toggles and checkboxes. The work does a good job filling the gap between general web benchmarks and safety ones by creating 200 task instances on 28 real websites. They include a custom Chrome extension for managing accounts and resetting states, plus an automated evaluator. Testing eight agent variants from multimodal LLMs and breaking down results by website, task type, and UI element gives useful granularity on where things break. The main concern is that the abstract and available details don't spell out how the tasks were validated or what counts as success in each case. It's also not clear if the 200 tasks represent typical real-world privacy actions or if the state management tool introduces any artifacts. With a manually curated set, selection bias is always a risk, though the stress test didn't flag inconsistencies. Overall, this is for researchers developing web agents who care about privacy and security implications. It shows honest engagement with the literature by identifying the specific gap and providing a reproducible setup. The math and data here are empirical measurements rather than complex models, so the citation pattern looks fine. I would bring this to a reading group for discussion on agent evaluation methods. It deserves peer review because the idea is timely and the evaluation setup is practical, even if revisions will likely strengthen the methods section.

Referee Report

2 major / 2 minor

Summary. The paper introduces WebSP-Eval, a benchmark framework for evaluating web agents on user-facing website security and privacy tasks such as cookie management, privacy settings, and session revocation. It consists of a manually curated dataset of 200 task instances spanning 28 websites, a custom Chrome extension for consistent account and initial-state management, and an automated evaluator. The authors evaluate eight agent instantiations based on state-of-the-art multimodal LLMs, performing fine-grained analysis by website, task category, and UI element type. Key findings include limited autonomous exploration capabilities overall and a failure rate exceeding 45% on tasks involving stateful UI elements such as toggles and checkboxes.

Significance. If the empirical results hold under rigorous validation, the work provides a timely benchmark that highlights a previously under-examined weakness in web agents: reliable handling of interactive, state-dependent security and privacy interfaces. The identification of stateful UI elements as a dominant failure mode offers a concrete, actionable direction for agent improvement. The framework's support for reproducible state management and automated evaluation is a practical contribution that could be adopted by the community. The fine-grained breakdown across categories strengthens the diagnostic value beyond aggregate success rates.

major comments (2)

[§3 (Task Dataset)] §3 (Task Dataset): The construction of the 200 manually crafted tasks is presented at a high level without reported validation steps such as human solvability checks, inter-annotator agreement, or pilot runs to confirm that each task has a well-defined, achievable ground-truth outcome. Because the central claims rest on measured failure rates (including the >45% rate for stateful elements), the absence of such validation leaves open the possibility that task formulation itself contributes to the observed difficulties.
[§4.1 (Agentic System and Chrome Extension)] §4.1 (Agentic System and Chrome Extension): The custom extension is described as ensuring consistent initial states across runs, yet no quantitative evaluation of its reliability (e.g., reset success rate, comparison against manual browser resets, or measurement of residual state leakage) is provided. This is load-bearing for the reproducibility of the reported performance numbers and for attributing failures specifically to agent limitations rather than evaluation artifacts.

minor comments (2)

[Abstract] Abstract: The claim of a 'fine-grained analysis' would be clearer if the abstract briefly named the success metric (e.g., task completion rate) and the exact method used to attribute failures to stateful UI elements.
[Related Work] Related Work: The discussion of prior benchmarks (WebArena, SafeArena) could more explicitly contrast the new tasks' focus on legitimate user security/privacy actions versus the safety-against-malicious-action emphasis of existing suites.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential impact of WebSP-Eval. We address each major comment point by point below, with clear indications of planned revisions to strengthen the manuscript's rigor and reproducibility.

read point-by-point responses

Referee: §3 (Task Dataset): The construction of the 200 manually crafted tasks is presented at a high level without reported validation steps such as human solvability checks, inter-annotator agreement, or pilot runs to confirm that each task has a well-defined, achievable ground-truth outcome. Because the central claims rest on measured failure rates (including the >45% rate for stateful elements), the absence of such validation leaves open the possibility that task formulation itself contributes to the observed difficulties.

Authors: We agree that additional details on task construction would improve transparency and help readers assess whether formulation contributes to observed failures. Each task was manually designed by the authors with explicit ground-truth action sequences derived from official website documentation and direct UI inspection to ensure a unique, verifiable outcome. Internal pilot testing was performed on a subset of tasks to confirm achievability before full-scale evaluation. We did not conduct formal multi-annotator agreement studies because curation was performed by a small expert team with iterative consensus. In the revised manuscript, we will expand §3 with a dedicated subsection describing the task creation methodology, including concrete examples of task definitions, ground-truth determination, and summary statistics from our internal pilots. This will allow readers to better evaluate the dataset's quality without altering the core results. revision: partial
Referee: §4.1 (Agentic System and Chrome Extension): The custom extension is described as ensuring consistent initial states across runs, yet no quantitative evaluation of its reliability (e.g., reset success rate, comparison against manual browser resets, or measurement of residual state leakage) is provided. This is load-bearing for the reproducibility of the reported performance numbers and for attributing failures specifically to agent limitations rather than evaluation artifacts.

Authors: We acknowledge that quantitative reliability metrics for the extension were not reported, which limits the ability to fully attribute failures to agent capabilities. The extension was implemented to handle deterministic state resets (clearing cookies, local storage, and session data) and account management, and our experimental runs showed consistent behavior with no observed state leakage affecting results. However, we did not include formal measurements such as reset success rates or comparisons to manual resets. In the revised version, we will add a new subsection (or appendix) to §4.1 that provides a more detailed technical description of the extension's architecture and reports any internal reliability checks performed during development. We are also prepared to conduct a small-scale quantitative validation experiment (e.g., measuring reset success over repeated trials) if the referee considers it essential for acceptance. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an empirical evaluation framework (WebSP-Eval) consisting of a manually crafted dataset of 200 task instances across 28 websites, a custom Chrome extension for state management, and an automated evaluator. All reported results, including the >45% failure rate on stateful UI elements such as toggles and checkboxes, are direct measurements obtained by executing 8 agent instantiations on these tasks. No mathematical derivations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear in the provided text. The central claims rest on independent empirical observations rather than any reduction to the paper's own inputs or self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions in benchmark creation for AI agents. No free parameters or invented entities are introduced; the work extends existing multimodal LLM agent architectures with a custom state-management extension.

axioms (1)

domain assumption The manually crafted 200 tasks across 28 websites represent typical real-world user security and privacy interactions.
This underpins the claim that the benchmark measures relevant agent capabilities.

pith-pipeline@v0.9.0 · 5559 in / 1431 out tokens · 64660 ms · 2026-05-10T18:44:14.179559+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

74 extracted references · 16 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Introducing claude sonnet 4.5.https://www-cdn.anthropic.com/ 963373e433e489a87a10c823c52a0a013e9172dd.pdf, September 2025

Anthropic. Introducing claude sonnet 4.5.https://www-cdn.anthropic.com/ 963373e433e489a87a10c823c52a0a013e9172dd.pdf, September 2025. Released September 29, 2025, Accessed: 02-03-2026

2025
[2]

Introducing claude opus 4.6.https://www.anthropic.com/news/claude-opus-4-6, February

Anthropic. Introducing claude opus 4.6.https://www.anthropic.com/news/claude-opus-4-6, February
[4]

Ringer: web automation by demonstration

Shaon Barman, Sarah Chasins, Rastislav Bodik, and Sumit Gulwani. Ringer: web automation by demonstration. InProceedings of the 2016 ACM SIGPLAN international conference on object-oriented programming, systems, languages, and applications, pages 748–764, 2016

2016
[5]

Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks.Advances in Neural Information Processing Systems, 37:5996–6051, 2024

L ´eo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault L De Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, and Alexandre Drouin. Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks.Advances in Neural Information Processing Systems, 37:5996–6051, 2024

2024
[6]

Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language bench- mark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language bench- mark. InForty-first International Conference on Machine Learning, 2024

2024
[7]

The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

De Chezelles, Thibault Le Sellier, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han L `u, Ori Yoran, Dehan Kong, Frank F Xu, Siva Reddy, Quentin Cappart, et al. The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467, 2024

work page arXiv 2024
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Gpt-4v-act: Chromium copilo.https://github.com/ddupont808/GPT-4V-Act, 2023

ddupont. Gpt-4v-act: Chromium copilo.https://github.com/ddupont808/GPT-4V-Act, 2023

2023
[10]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

2023
[11]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, L ´eo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

work page internal anchor Pith review arXiv 2024
[12]

How websites and apps collect and use your information.https://consumer

Federal Trade Commission. How websites and apps collect and use your information.https://consumer. ftc.gov/articles/how-websites-apps-collect-use-your-information, 2025. Accessed: 2025-09- 25

2025
[13]

Gemini 3 Technical Report

Gemini Team. Gemini 3 Technical Report. Technical report, Google DeepMind, November 2025

2025
[14]

Recorder panel: Record and measure user flow — chrome devtools.https://developer.chrome

Google. Recorder panel: Record and measure user flow — chrome devtools.https://developer.chrome. com/docs/devtools/recorder/overview, 2024. Accessed: 2026-02-26

2024
[15]

Gemini 3.1 pro.https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/, 2026

Google. Gemini 3.1 pro.https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/, 2026. Accessed: 2026-02-28

2026
[16]

Google AI studio.https://aistudio.google.com/, 2026

Google. Google AI studio.https://aistudio.google.com/, 2026. Accessed: 2026-02-26

2026
[17]

Manifest v3 — chrome for developers.https://developer.chrome.com/docs/extensions/ develop/migrate/what-is-mv3, 2026

Google. Manifest v3 — chrome for developers.https://developer.chrome.com/docs/extensions/ develop/migrate/what-is-mv3, 2026. Accessed: 2026-02-26

2026
[18]

Puppeteer: Node.js api for chrome.https://pptr.dev/, 2026

Google. Puppeteer: Node.js api for chrome.https://pptr.dev/, 2026. Accessed: 2026-02-26

2026
[19]

Project Mariner: An autonomous web agent, 2025

Google DeepMind. Project Mariner: An autonomous web agent, 2025. Accessed: 2026-01-24

2025
[20]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024

work page arXiv 2024
[21]

Cowpilot: a framework for autonomous and human-agent collaborative web navigation

Faria Huq, Zora Zhiruo Wang, Frank F Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P Bigham, and Graham Neubig. Cowpilot: a framework for autonomous and human-agent collaborative web navigation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstra...

2025
[22]

Empirically validated web page design metrics

Melody Y Ivory, Rashmi R Sinha, and Marti A Hearst. Empirically validated web page design metrics. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 53–60, 2001. 17 WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

2001
[23]

Prometheus-vision: Vision-language model as a judge for fine-grained evaluation

Seongyun Lee, Seungone Kim, Sue Park, Geewook Kim, and Minjoon Seo. Prometheus-vision: Vision-language model as a judge for fine-grained evaluation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11286–11315, 2024

2024
[24]

Robula+: An algorithm for generating robust xpath locators for web testing.Journal of Software: Evolution and Process, 28(3):177–204, 2016

Maurizio Leotta, Andrea Stocco, Filippo Ricca, and Paolo Tonella. Robula+: An algorithm for generating robust xpath locators for web testing.Journal of Software: Evolution and Process, 28(3):177–204, 2016

2016
[25]

arXiv preprint arXiv:2410.06703 , year=

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov. St-webagentbench: A bench- mark for evaluating safety and trustworthiness in web agents.arXiv preprint arXiv:2410.06703, 2024

work page arXiv 2024
[26]

Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang

Haoran Li, Wenbin Hu, Huihao Jing, Yulin Chen, Qi Hu, Sirui Han, Tianshu Chu, Peizhao Hu, and Yangqiu Song. Privaci-bench: Evaluating privacy with contextual integrity and legal compliance.arXiv preprint arXiv:2502.17041, 2025

work page arXiv 2025
[27]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[28]

Agen- tRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

Xing Han L `u, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Sta ´nczak, Peter Shaw, Christopher J Pal, and Siva Reddy. Agentrewardbench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

work page arXiv 2025
[29]

Using shadow dom - web apis — mdn.https://developer.mozilla.org/en-US/ docs/Web/API/Web_components/Using_shadow_DOM, 2025

MDN contributors. Using shadow dom - web apis — mdn.https://developer.mozilla.org/en-US/ docs/Web/API/Web_components/Using_shadow_DOM, 2025. Accessed: 2026-02-26

2025
[30]

Playwright: Fast and reliable end-to-end testing for modern web apps.https://playwright.dev/,

Microsoft. Playwright: Fast and reliable end-to-end testing for modern web apps.https://playwright.dev/,
[31]

Accessed: 2026-02-26

2026
[32]

Improving web element localization by using a large language model.Software Testing, Verification and Reliability, 34(7):e1893, 2024

Michel Nass, Emil Al ´egroth, and Robert Feldt. Improving web element localization by using a large language model.Software Testing, Verification and Reliability, 34(7):e1893, 2024

2024
[33]

Advice & guidance — all topics.https://www.ncsc.gov.uk/ section/advice-guidance/all-topics, 2025

National Cyber Security Centre (NCSC). Advice & guidance — all topics.https://www.ncsc.gov.uk/ section/advice-guidance/all-topics, 2025. Accessed: 2025-09-25

2025
[34]

The NIST cybersecurity framework (CSF) 2.0

National Institute of Standards and Technology. The NIST cybersecurity framework (CSF) 2.0. Cybersecurity White Paper CSWP 29, National Institute of Standards and Technology, 2024. Accessed: 2025-09-25

2024
[35]

arXiv preprint arXiv:2503.23350 , year=

Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S Yu, et al. A survey of webagents: Towards next-generation ai agents for web automation with large foundation models.arXiv preprint arXiv:2503.23350, 2025

work page arXiv 2025
[36]

Introducing chatgpt atlas

OpenAI. Introducing chatgpt atlas. Technical report, OpenAI, October 2025. Accessed: 2026-01-24

2025
[37]

Operator system card

OpenAI. Operator system card. Technical report, OpenAI, January 2025. Accessed: 2026-01-24

2025
[38]

Comet: The AI-powered browser, 2025.https://www.perplexity.ai/comet

Perplexity AI. Comet: The AI-powered browser, 2025.https://www.perplexity.ai/comet

2025
[39]

Tranco: A research- oriented top sites ranking hardened against manipulation.arXiv preprint arXiv:1806.01156, 2018

Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob, Wouter Joosen, et al. Tranco: A research- oriented top sites ranking hardened against manipulation.arXiv preprint arXiv:1806.01156, 2018

work page arXiv 2018
[40]

Arianna Rossi and Simon Parkin. ” what i’m interested in is something that violates the law”: Regulatory practitioner views on automated detection of deceptive design patterns.arXiv preprint arXiv:2602.16302, 2026

work page arXiv 2026
[41]

Cookie Consent Trends by Country: 2026 Global Compliance Guide.https://www.cookieyes.com/ blog/cookie-consent-trends/, January 2026

Safna. Cookie Consent Trends by Country: 2026 Global Compliance Guide.https://www.cookieyes.com/ blog/cookie-consent-trends/, January 2026. Accessed: 2026-02-01

2026
[42]

Google recaptcha solver.https://github.com/sarperavci/GoogleRecaptchaBypass, 2024

sarperavci. Google recaptcha solver.https://github.com/sarperavci/GoogleRecaptchaBypass, 2024

2024
[43]

Selenium automates browsers

Selenium Project. Selenium automates browsers. that’s it!https://www.selenium.dev/, 2026. Accessed: 2026-02-26

2026
[44]

shadcn/ui: The foundation for your design system.https://ui.shadcn.com/, 2026

shadcn. shadcn/ui: The foundation for your design system.https://ui.shadcn.com/, 2026. Accessed: 2026-02-25

2026
[45]

Privacylens: Evaluating privacy norm aware- ness of language models in action.Advances in Neural Information Processing Systems, 37:89373–89407, 2024

Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang. Privacylens: Evaluating privacy norm aware- ness of language models in action.Advances in Neural Information Processing Systems, 37:89373–89407, 2024

2024
[46]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 18 WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Trellix trustedsource web database reference guide

Trellix TrustedSource. Trellix trustedsource web database reference guide. Technical report, Trellix, 2024. https://trustedsource.org/download/ts_wd_reference_guide.pdf

2024
[49]

arXiv preprint arXiv:2503.04957 , year =

Ada Defne Tur, Nicholas Meade, Xing Han L`u, Alejandra Zambrano, Arkil Patel, Esin Durmus, Spandana Gella, Karolina Sta´nczak, and Siva Reddy. Safearena: Evaluating the safety of autonomous web agents.arXiv preprint arXiv:2503.04957, 2025

work page arXiv 2025
[50]

undetected-chromedriver: Custom selenium chromedriver — zero-config — passes all bot mitigation systems.https://github.com/ultrafunkamsterdam/undetected-chromedriver, 2026

ultrafunkamsterdam. undetected-chromedriver: Custom selenium chromedriver — zero-config — passes all bot mitigation systems.https://github.com/ultrafunkamsterdam/undetected-chromedriver, 2026. Accessed: 2026-02-27

2026
[51]

Structural profiling of web sites in the wild

R ´emy van der Heijden and Cormier P´epin. Structural profiling of web sites in the wild. InInternational Confer- ence on Web Engineering (ICWE), pages 225–240. Springer, 2020

2020
[52]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

work page internal anchor Pith review arXiv 2023
[53]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744– 20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744– 20757, 2022

2022
[54]

Assistantbench: Can web agents solve realistic and time-consuming tasks? arXiv preprint arXiv:2407.15711, 2024

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assis- tantbench: Can web agents solve realistic and time-consuming tasks?arXiv preprint arXiv:2407.15711, 2024

work page arXiv 2024
[55]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023. 19 WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks A Task Details Task Category Task IDs ...

work page internal anchor Pith review arXiv 2023
[56]

After typing, the system does NOT automati- cally press Enter - you must explicitly click the search/submit button if needed

To input text, NO need to click textbox first, directly type content. After typing, the system does NOT automati- cally press Enter - you must explicitly click the search/submit button if needed. Try to use simple language when searching
[57]

You must Distinguish between textbox and search button, don’t type content into the button! If no textbox is found, you may need to click the search button first before the textbox is displayed
[58]

Execute only one action per iteration
[59]

You may have selected the wrong web element or numerical label

STRICTLY Avoid repeating the same action if the webpage remains unchanged. You may have selected the wrong web element or numerical label. Continuous use of the Wait is also NOT allowed
[60]

Flexibly combine your own abilities with the information in the web page

When a complex Task involves multiple questions or steps, selectANSWERonly at the very end, after addressing all of these questions (steps). Flexibly combine your own abilities with the information in the web page. Double check the formatting requirements in the task whenANSWER
[61]

Web Browsing Guidelines

TheTYPEaction can also be used to just delete the content of a textbox without typing anything by giving an empty content. Web Browsing Guidelines
[62]

For tasks that require login and you do not find yourself already authenticated, you should try to login using default google account{DEFAULT GOOGLE ACCOUNT}
[63]

Pay attention to Key Web Elements like search textbox and menu

Don’t interact with useless web elements like donation that appear in Webpages. Pay attention to Key Web Elements like search textbox and menu
[64]

Clicking to download PDF is allowed and will be analyzed by the Assistant API

Visit video websites like YouTube is allowed BUT you can’t play videos. Clicking to download PDF is allowed and will be analyzed by the Assistant API
[65]

Ensure you don’t mix them up with other numbers (e.g

Focus on the numerical labels in the TOP LEFT corner of each rectangle (element). Ensure you don’t mix them up with other numbers (e.g. Calendar) on the page
[66]

It may be necessary to find the correct year, month and day at calendar

Focus on the date in task, you must look for results that match the date. It may be necessary to find the correct year, month and day at calendar
[67]

Already Done

Pay attention to the filter and sort functions on the page, which, combined with scroll, can help you solve conditions like ‘highest’, ‘cheapest’, ‘lowest’, ‘earliest’, etc. Try your best to find the answer that best fits the task. Your reply should strictly follow the format: Thought:{{Your brief thoughts (briefly summarize the info that will help ANSWER...
[68]

CORRECT The agent executed necessary steps and the final state reflects the desired outcome (matches the expected outcome from Ground Truth)
[69]

result":

INCORRECT The agent failed to achieve the goal due to any reason (navigation errors, incomplete steps, hallucinated actions, semantic reversals, or post-completion destructive actions). REASONING GUIDELINES When writing thereasonfield, you must adhere to the following structure: 23 WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks 1....
[70]

Carefully analyze the failure explanation inGemini3.1 Response
[71]

Identify which UI element type(s) fromGround Truth UI Elementscaused the failure
[72]

Add a new key calledWHICH UI ELEMENT FAILEDto each entry
[73]

The value ofWHICH UI ELEMENT FAILEDmust contain only the UI element type(s) that directly caused the failure
[74]

If multiple UI element types contributed to the failure, include all of them
[75]

TaskID" –

Do not critique or evaluate the correctness of the ground truth. Every entry is already a confirmed failure. 24 WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks OUTPUT REQUIREMENTS • Return a JSON list of dictionaries. • Each dictionary must contain exactly the following keys: –"TaskID" –"WHICH UI ELEMENT FAILED" –"Ground Truth UI E...