OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
hub Mixed citations
GPT-4V(ision) is a Generalist Web Agent, if Grounded
Mixed citation behavior. Most common role is background (67%).
abstract
The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. We show that GPT-4V presents a great potential for web agents -- it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents. However, grounding still remains a major challenge. Existing LMM grounding strategies like set-of-mark prompting turns out to be not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML structure and visuals. Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement. All code, data, and evaluation tools are available at https://github.com/OSU-NLP-Group/SeeAct.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
The paper builds SOPBench showing frequent SOP violations in agentic browsers and introduces SOPGuard to enforce the policy with low overhead in BrowserOS.
Proposes COM-as-Action paradigm for deterministic software manipulation, introduces ComCADBench benchmark and ComActor agent that achieves SOTA performance over GUI baselines.
Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
VLMs detect primitive motion in UI animations reliably but show inconsistent high-level interpretation of purposes and meanings, with large gaps relative to human performance.
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.
ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.
OmniGUI is the first step-level benchmark supplying interleaved image, audio, and video inputs across 709 expert episodes in 29 smartphone apps to evaluate multimodal GUI agents.
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.
Failure-driven self-improvement raises OpenCUA-72B success rate on OSWorld from 42.3% to 48.9% via LLM diagnosis and inference-time code patches, without retraining.
OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
SCALE introduces three adversarial roles (Selector, Predictor, Judger) and a graph exploration method (SCALE-Hop) to enable MLLM-based web agents to self-discover limitations and improve, backed by the SCALE-20k dataset from 19 websites.
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
ReVision reduces token usage by 46% and improves success rate by 3% on OSWorld, WebTailBench, and AgentNetBench by removing redundant visual patches from 5-history trajectories with Qwen2.5-VL-7B.
DRIVE disentangles reasoning and interaction skills for web agents via dual-level modeling and scene-aware coordination, reaching 52.8% success on WebArena tasks.
PageGuide is a browser extension that grounds LLM responses in webpage DOM elements via visual overlays for Find, Guide, and Hide modes, reporting performance gains over unaided browsing in a 94-user study.
citing papers explorer
-
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.