ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Alon Oved; Avi Yaeli; Ben Wiesel; Ido Levy; Nir Mashkif; Sami Marreed; Segev Shlomov

arxiv: 2410.06703 · v7 · pith:JI3Z6YVCnew · submitted 2024-10-09 · 💻 cs.AI

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Ido Levy , Ben Wiesel , Sami Marreed , Alon Oved , Avi Yaeli , Nir Mashkif , Segev Shlomov This is my paper

classification 💻 cs.AI

keywords agentsst-webagentbenchevaluatingsafetytextitacrossagentcompletion

0 comments

read the original abstract

Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical workflows, safety and trustworthiness (ST) are prerequisite conditions for adoption. We introduce \textbf{\textsc{ST-WebAgentBench}}, a configurable and easily extensible suite for evaluating web agent ST across realistic enterprise scenarios. Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six orthogonal dimensions (e.g., user consent, robustness). Beyond raw task success, we propose the \textit{Completion Under Policy} (\textit{CuP}) metric, which credits only completions that respect all applicable policies, and the \textit{Risk Ratio}, which quantifies ST breaches across dimensions. Evaluating three open state-of-the-art agents reveals that their average CuP is less than two-thirds of their nominal completion rate, exposing critical safety gaps. By releasing code, evaluation templates, and a policy-authoring interface, \href{https://sites.google.com/view/st-webagentbench/home}{\textsc{ST-WebAgentBench}} provides an actionable first step toward deploying trustworthy web agents at scale.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
cs.CY 2026-04 accept novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that be...
GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents
cs.CR 2026-01 unverdicted novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
cs.CL 2026-05 unverdicted novelty 7.0

DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
cs.CR 2026-04 unverdicted novelty 7.0

WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
cs.CR 2025-10 unverdicted novelty 7.0

SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior,...
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
cs.AI 2025-06 unverdicted novelty 7.0

Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
cs.AI 2026-05 unverdicted novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
cs.SE 2026-04 unverdicted novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
PageGuide: Browser extension to assist users in navigating a webpage and locating information
cs.HC 2026-04 accept novelty 6.0

PageGuide grounds LLM answers in webpage DOM elements using visual overlays for find, guide, and hide modes, yielding measurable gains in a 94-user study.
Governance by Construction for Generalist Agents
cs.AI 2026-05 unverdicted novelty 5.0

CUGA introduces a runtime governance architecture that enforces policies at five checkpoints in generalist agent execution pipelines for predictable and compliant behavior.
An Executable Benchmarking Suite for Tool-Using Agents
cs.SE 2026-05 unverdicted novelty 5.0

The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
cs.AI 2025-10 unverdicted novelty 4.0

A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.