hub Canonical reference

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang · 2024 · cs.LG · arXiv 2401.13649

Canonical reference. 90% of citing Pith papers cite this work as background.

42 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 42 citing papers arXiv PDF

abstract

Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web. Our code, baseline models, and data is publicly available at https://jykoh.com/vwa.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 dataset 1

citation-polarity summary

background 9 use dataset 1

representative citing papers

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

cs.CL · 2026-06-27 · unverdicted · novelty 7.0

SEATauBench is the first agent benchmark for SEA languages, finding that performance holds for language-only changes but degrades sharply with full domain localization.

Same-Origin Policy for Agentic Browsers

cs.CR · 2026-06-12 · unverdicted · novelty 7.0

The paper builds SOPBench showing frequent SOP violations in agentic browsers and introduces SOPGuard to enforce the policy with low overhead in BrowserOS.

WebChallenger: A Reliable and Efficient Generalist Web Agent

cs.CL · 2026-06-09 · conditional · novelty 7.0

WebChallenger introduces PageMem and three architecture mechanisms to achieve competitive web navigation with open-weight LLMs on WebArena, VisualWebArena, Online-Mind2Web, and WorkArena without fine-tuning or site adapters.

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

cs.CR · 2026-06-03 · unverdicted · novelty 7.0

Frontier browser agents show strong resistance to hand-crafted multi-step prompt injections (0/140 success), unlike coding agents (up to 100%), indicating domain-conditioned safety and that prior high ASR reports may not generalize.

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

cs.SE · 2026-05-22 · unverdicted · novelty 7.0

VISTA is a new benchmark for end-to-end visual spec-to-web-app generation by LLM agents, featuring five prompt conditions, manual UI annotations, multi-metric evaluation, and results on four agent systems showing partial decoupling of visual and functional performance.

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

cs.AI · 2026-05-15 · conditional · novelty 7.0

ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of agent performance between synthetic and live shops.

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

cs.SE · 2026-03-04 · unverdicted · novelty 7.0

Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

cs.CR · 2025-10-11 · unverdicted · novelty 7.0

SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

cs.AI · 2024-05-23 · accept · novelty 7.0

AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.

Rehearsed Multi-Agent Live Product Demonstrations with Real-Time Voice Question Answering

cs.AI · 2026-06-29 · unverdicted · novelty 6.0

Rhetor automates rehearsed live web-app demos with segment-synchronized narration and real-time voice QA using cross-modal UI-plus-code features, a grounded scripter, rehearsal loops, and timing invariants, with case-study metrics on four applications.

Signal-Driven Observation for Long-Horizon Web Agents

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Signal-Driven Observation decouples observation from action frequency in long-horizon web agents by invoking selective task-relevant DOM reads only on signals such as URL changes or action failures.

Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents

cs.LG · 2026-06-03 · unverdicted · novelty 6.0

LifeSkill is a verifier-guided skill learning plus online internalization framework that raises average performance by 7 points over lifelong agent baselines on LifelongAgentBench.

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

cs.AI · 2026-05-29 · unverdicted · novelty 6.0

SCALE introduces three adversarial roles (Selector, Predictor, Judger) and a graph exploration method (SCALE-Hop) to enable MLLM-based web agents to self-discover limitations and improve, backed by the SCALE-20k dataset from 19 websites.

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

cs.AI · 2026-05-15 · accept · novelty 6.0

SaaS-Bench benchmark shows LLM-based agents achieve under 4% end-to-end success on 106 realistic professional tasks spanning 23 deployable SaaS platforms.

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

cs.CL · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ReVision reduces token usage by 46% and improves success rate by 3% on OSWorld, WebTailBench, and AgentNetBench by removing redundant visual patches from 5-history trajectories with Qwen2.5-VL-7B.

MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

cs.MM · 2026-05-08 · unverdicted · novelty 6.0

MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

cs.CL · 2026-04-23 · conditional · novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

cs.AI · 2026-03-22 · conditional · novelty 6.0

AdaRubric adaptively generates task-specific rubrics via LLM, scores agent trajectories with per-dimension confidence weighting, and produces filtered DPO pairs that raise human correlation to Pearson r=0.79 and downstream task success by 6.8-8.5%.

WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

cs.AI · 2026-03-05 · unverdicted · novelty 6.0

WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.

From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails

cs.AI · 2025-10-15 · unverdicted · novelty 6.0

Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.

citing papers explorer

Showing 17 of 17 citing papers after filters.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments cs.AI · 2024-04-11 · accept · none · ref 22 · internal anchor
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis cs.AI · 2026-05-24 · unverdicted · none · ref 60 · internal anchor
ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps
ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents cs.AI · 2026-05-15 · conditional · none · ref 9 · internal anchor
ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of agent performance between synthetic and live shops.
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents cs.AI · 2024-05-23 · accept · none · ref 15 · internal anchor
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
Rehearsed Multi-Agent Live Product Demonstrations with Real-Time Voice Question Answering cs.AI · 2026-06-29 · unverdicted · none · ref 3 · internal anchor
Rhetor automates rehearsed live web-app demos with segment-synchronized narration and real-time voice QA using cross-modal UI-plus-code features, a grounded scripter, rehearsal loops, and timing invariants, with case-study metrics on four applications.
Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration cs.AI · 2026-05-29 · unverdicted · none · ref 11 · internal anchor
SCALE introduces three adversarial roles (Selector, Predictor, Judger) and a graph exploration method (SCALE-Hop) to enable MLLM-based web agents to self-discover limitations and improve, backed by the SCALE-20k dataset from 19 websites.
DocOS: Towards Proactive Document-Guided Actions in GUI Agents cs.AI · 2026-05-18 · unverdicted · none · ref 61 · internal anchor
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows? cs.AI · 2026-05-15 · accept · none · ref 4 · internal anchor
SaaS-Bench benchmark shows LLM-based agents achieve under 4% end-to-end success on 106 realistic professional tasks spanning 23 deployable SaaS platforms.
AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning cs.AI · 2026-03-22 · conditional · none · ref 5 · internal anchor
AdaRubric adaptively generates task-specific rubrics via LLM, scores agent trajectories with per-dimension confidence weighting, and produces filtered DPO pairs that raise human correlation to Pearson r=0.79 and downstream task success by 6.8-8.5%.
WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents cs.AI · 2026-03-05 · unverdicted · none · ref 11 · internal anchor
WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.
From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails cs.AI · 2025-10-15 · unverdicted · none · ref 2 · internal anchor
Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments cs.AI · 2025-06-03 · unverdicted · none · ref 30 · internal anchor
VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work cs.AI · 2026-05-07 · conditional · none · ref 40 · internal anchor
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 46 · internal anchor
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
LLM-Powered AI Agent Systems and Their Applications in Industry cs.AI · 2025-05-22 · unverdicted · none · ref 108 · internal anchor
A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents cs.AI · 2026-04-20 · unreviewed · ref 21 · internal anchor
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks cs.AI · 2025-05-26 · unreviewed · ref 15 · internal anchor

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer