pith. sign in

arxiv: 2502.08047 · v5 · pith:GFCIV7O6new · submitted 2025-02-12 · 💻 cs.AI · cs.MA

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

classification 💻 cs.AI cs.MA
keywords agentsbenchmarkinitialplanningworldguiapplicationsdesktopframework
0
0 comments X
read the original abstract

Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke assistance mid-workflow, where software may be partially configured, steps may have been executed in different orders, or the interface may differ from its default setup. Such task-state variability is pervasive but insufficiently evaluated in existing GUI benchmarks. To address this gap, we introduce WorldGUI, a benchmark covering ten widely used desktop and web applications with tasks instantiated under diverse, systematically constructed initial states. These variations capture realistic human-computer interaction settings and enable diagnostic evaluation of an agent's ability to recover, adapt plans, and handle non-default contexts. We further present WorldGUI-Agent, a simple and model-agnostic framework that organizes planning and execution around three critique stages, improving reliability in dynamic environments. Experiments demonstrate that state-of-the-art GUI agents exhibit substantial performance degradation under non-default initial conditions, revealing limited robustness and fragile planning behaviors. Our benchmark and framework provide a foundation for developing more adaptable and reliable GUI agents. The code and data are available at https://github.com/showlab/WorldGUI.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

    cs.AI 2026-05 unverdicted novelty 7.0

    PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.

  2. GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

    cs.LG 2026-04 conditional novelty 7.0

    GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.

  3. ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

    cs.AI 2026-02 conditional novelty 7.0

    ProactiveMobile is a new benchmark for proactive mobile agents that tests latent intent inference from context and executable API generation, where a fine-tuned 7B model reaches 19.15% success versus 15.71% for o1 and...

  4. Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

    cs.CL 2026-05 unverdicted novelty 4.0

    The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...