WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Difei Gao; Henry Hengyuan Zhao; Kaiming Yang; Mike Zheng Shou; Wendi Yu

arxiv: 2502.08047 · v5 · pith:GFCIV7O6new · submitted 2025-02-12 · 💻 cs.AI · cs.MA

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Henry Hengyuan Zhao , Kaiming Yang , Wendi Yu , Difei Gao , Mike Zheng Shou This is my paper

classification 💻 cs.AI cs.MA

keywords agentsbenchmarkinitialplanningworldguiapplicationsdesktopframework

0 comments

read the original abstract

Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke assistance mid-workflow, where software may be partially configured, steps may have been executed in different orders, or the interface may differ from its default setup. Such task-state variability is pervasive but insufficiently evaluated in existing GUI benchmarks. To address this gap, we introduce WorldGUI, a benchmark covering ten widely used desktop and web applications with tasks instantiated under diverse, systematically constructed initial states. These variations capture realistic human-computer interaction settings and enable diagnostic evaluation of an agent's ability to recover, adapt plans, and handle non-default contexts. We further present WorldGUI-Agent, a simple and model-agnostic framework that organizes planning and execution around three critique stages, improving reliability in dynamic environments. Experiments demonstrate that state-of-the-art GUI agents exhibit substantial performance degradation under non-default initial conditions, revealing limited robustness and fragile planning behaviors. Our benchmark and framework provide a foundation for developing more adaptable and reliable GUI agents. The code and data are available at https://github.com/showlab/WorldGUI.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
cs.AI 2026-05 unverdicted novelty 7.0

PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
cs.LG 2026-04 conditional novelty 7.0

GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices
cs.AI 2026-02 conditional novelty 7.0

ProactiveMobile is a new benchmark for proactive mobile agents that tests latent intent inference from context and executable API generation, where a fine-tuned 7B model reaches 19.15% success versus 15.71% for o1 and...
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...