arxiv: 2410.05243 · v3 · submitted 2024-10-07 · 💻 cs.AI · cs.CL· cs.CV

Recognition: no theorem link

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyu Gou , Ruohan Wang , Boyuan Zheng , Yanan Xie , Cheng Chang , Yiheng Shu , Huan Sun , Yu Su

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:05 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords visual groundingGUI agentssynthetic datamultimodal LLMspixel coordinatesreferring expressionsLLaVAuniversal model

0 comments

The pith

A model trained on web synthetic GUI data enables agents to ground referring expressions to pixels using only screenshots, outperforming text-augmented systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that GUI agents can navigate interfaces like humans by perceiving everything visually and mapping natural language descriptions directly to screen coordinates, without relying on text representations such as HTML or accessibility trees. This capability comes from training UGround on the largest GUI dataset collected so far, built from 1.3 million screenshots containing 10 million elements and their referring expressions using only web-based synthetic data plus minor changes to the LLaVA architecture. A sympathetic reader would care because text-based inputs often add noise, incompleteness, and extra computation, while a robust visual approach promises simpler, more reliable agents across platforms. Results indicate up to 20 percent gains in grounding accuracy and superior overall agent performance on six benchmarks even when competing systems receive extra text signals.

Core claim

UGround is a universal visual grounding model trained on 10M GUI elements and referring expressions across 1.3M screenshots collected via web synthetic data, using a slight adaptation of the LLaVA architecture to map diverse expressions to pixel coordinates. This enables GUI agents that perceive the environment entirely visually and perform pixel-level operations, delivering up to 20 percent absolute gains over prior visual grounding models and outperforming state-of-the-art agents despite those agents using additional text-based inputs.

What carries the argument

UGround, an adapted LLaVA model trained on web synthetic data that maps referring expressions of GUI elements to their pixel coordinates on screenshots.

If this is right

Pure visual agents surpass hybrid agents that receive extra text input on grounding, offline agent, and online agent tasks.
Synthetic web data suffices to train models that generalize across multiple GUI platforms without platform-specific text parsing.
Direct pixel-level operations reduce overhead from noisy or incomplete text representations.
Agents require only visual perception to achieve higher benchmark scores than prior state-of-the-art systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar synthetic-data recipes could extend visual grounding to dynamic or non-web interfaces such as mobile apps and desktop software.
Combining UGround with improved planning modules might support longer multi-step tasks while staying visual-only.
The approach could lower engineering effort for new platforms by removing dependence on text extraction pipelines.

Load-bearing premise

Web-based synthetic data plus minor LLaVA adaptation produces a model that generalizes to real-world diverse GUI platforms and referring expressions.

What would settle it

A new benchmark using GUI platforms or referring expressions absent from the web synthetic training data on which UGround shows no improvement or falls below text-based baselines.

read the original abstract

Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly perform pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents. Empirical results on six benchmarks spanning three categories (grounding, offline agent, and online agent) show that 1) UGround substantially outperforms existing visual grounding models for GUI agents, by up to 20% absolute, and 2) agents with UGround outperform state-of-the-art agents, despite the fact that existing agents use additional text-based input while ours only uses visual perception. These results provide strong support for the feasibility and promises of GUI agents that navigate the digital world as humans do.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UGround shows a simple visual model trained on large web GUI data can beat text-augmented agents on benchmarks, but the universal generalization claim rests on untested domain overlap.

read the letter

The main takeaway is that a visual grounding model trained mostly on synthetic web screenshots can outperform prior visual grounders by up to 20% and even surpass current agent systems that get extra text input like HTML or trees. They collected the largest GUI grounding dataset reported so far, with 10M elements and referring expressions from 1.3M screenshots, then used a light LLaVA adaptation to train UGround. On six benchmarks across grounding, offline agent, and online agent tasks, the results show consistent gains for a pixels-only approach. This gives concrete support for the idea that human-like visual perception can work for GUI navigation without structured text representations. The scale of the data collection stands out as a practical step forward that others could use. The multi-category evaluation adds weight by linking grounding accuracy directly to agent performance. The softer spot is generalization. Training relies on web-based synthetic data, and the paper does not include explicit analysis of domain shifts such as mobile versus desktop, native versus browser elements, or overlap between training and test distributions. If the benchmarks stay close to the web synthetic style, the reported margins may not fully prove robustness on diverse real platforms. Details on data splits and baseline re-implementations are also thin at the abstract level, which leaves some uncertainty on the exact numbers. This paper is for researchers building GUI agents or multimodal interface models who want to test vision-only setups. A reader focused on agent robustness or data scaling would find the comparisons useful. It deserves peer review because the dataset size and empirical gains are substantial enough to warrant detailed checking, even if revisions need to address the domain questions.

Referee Report

3 major / 2 minor

Summary. The paper introduces UGround, a visual grounding model for GUI agents trained via a simple recipe of web-based synthetic data (1.3M screenshots, 10M elements) and minor adaptation of the LLaVA architecture. It claims this yields a universal pixel-level grounder that outperforms prior visual grounding models by up to 20% absolute across six benchmarks in three categories (grounding, offline agent, online agent tasks) and enables purely visual agents to surpass SOTA agents that rely on additional text inputs such as HTML or accessibility trees.

Significance. If the generalization claims hold, the work is significant: it supplies the largest reported GUI grounding dataset to date and provides multi-category empirical evidence that a purely visual, human-like perception pipeline can outperform text-augmented baselines. This supports a shift away from noisy structured representations toward direct pixel grounding, with potential benefits for robustness and reduced overhead in real-world GUI agents.

major comments (3)

[Abstract and §4] Abstract and empirical evaluation section: the central claim that UGround provides 'universal' visual grounding (up to 20% gains and outperformance of text-augmented SOTA) depends on robust transfer from web synthetic training data to the test distributions. No domain-shift analysis is presented (e.g., fraction of mobile vs. desktop screenshots, native-app vs. browser-rendered elements, or overlap statistics between the 10M training elements and the six evaluation sets). This is load-bearing for the headline result.
[§4] §4 (Empirical Evaluation): insufficient detail on data splits, baseline reproduction, and potential synthetic-data biases. It is unclear whether any of the six benchmarks overlap with the synthetic collection process or how existing visual grounders were re-implemented, which directly affects interpretation of the reported margins.
[§3] §3 (Method): the 'slight adaptation' of LLaVA is described at a high level only. Concrete specification of modified components, training objectives, and hyper-parameters is needed to evaluate whether the gains stem from the data scale, the architecture change, or both.

minor comments (2)

[Abstract] Abstract: the six benchmarks are referred to only by category; naming them explicitly would aid readers in assessing coverage.
[§4] Tables in §4: performance numbers should include standard deviations or confidence intervals so that the magnitude of the 20% gains can be statistically evaluated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional analyses and details that strengthen the claims regarding transfer and reproducibility.

read point-by-point responses

Referee: [Abstract and §4] Abstract and empirical evaluation section: the central claim that UGround provides 'universal' visual grounding (up to 20% gains and outperformance of text-augmented SOTA) depends on robust transfer from web synthetic training data to the test distributions. No domain-shift analysis is presented (e.g., fraction of mobile vs. desktop screenshots, native-app vs. browser-rendered elements, or overlap statistics between the 10M training elements and the six evaluation sets). This is load-bearing for the headline result.

Authors: We agree that explicit domain-shift analysis is valuable for supporting the universality claim. In the revised manuscript we have added a new subsection (Section 4.3) that reports the training data composition, including approximate fractions of mobile versus desktop screenshots and browser-rendered versus native-app elements based on source metadata. We also include pairwise similarity metrics (CLIP embedding cosine similarity and element-type distribution KL divergence) between the 1.3 M training screenshots and each of the six evaluation sets to quantify distribution shift. While exhaustive element-level overlap is intractable at this scale, the reported metrics show moderate-to-high similarity on grounding benchmarks and lower similarity on agent benchmarks, consistent with the observed performance gains. revision: yes
Referee: [§4] §4 (Empirical Evaluation): insufficient detail on data splits, baseline reproduction, and potential synthetic-data biases. It is unclear whether any of the six benchmarks overlap with the synthetic collection process or how existing visual grounders were re-implemented, which directly affects interpretation of the reported margins.

Authors: We have expanded Section 4.1 and the appendix with a dedicated paragraph on data collection safeguards: the synthetic pipeline used distinct web domains and randomization seeds that were explicitly excluded from the six public benchmarks. We now list the exact train/validation splits used for UGround (90/10 on the 1.3 M screenshots) and provide the precise re-implementation details for all baselines, including the GitHub commit hashes, hyper-parameters, and prompt templates employed. Potential synthetic-data biases (e.g., over-representation of common UI patterns) are discussed with a new ablation that subsamples the training set by element frequency; the performance margins remain stable, indicating that the gains are not driven by simple frequency bias. revision: yes
Referee: [§3] §3 (Method): the 'slight adaptation' of LLaVA is described at a high level only. Concrete specification of modified components, training objectives, and hyper-parameters is needed to evaluate whether the gains stem from the data scale, the architecture change, or both.

Authors: We have rewritten Section 3.2 to provide the requested concrete specification. The modifications consist of (1) replacing the original vision encoder projection with a two-layer MLP that directly outputs normalized (x,y) coordinates, (2) adding a coordinate regression head with L1 + IoU loss, and (3) freezing the language model while fine-tuning only the vision encoder and projection layers. All hyper-parameters are now listed in Table 2 (learning rate 2e-5, batch size 128, 3 epochs, AdamW with cosine decay). An ablation study (new Table 3) isolates the contribution of data scale versus these architectural changes, showing that both factors are necessary to reach the reported performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the empirical pipeline

full rationale

The paper presents an empirical approach: collection of a new 1.3M-screenshot / 10M-element GUI dataset from web sources, slight adaptation of the external LLaVA architecture, and direct evaluation on six independent benchmarks spanning grounding, offline, and online agent tasks. No mathematical derivation, parameter-fitting step that is later relabeled as a prediction, or load-bearing self-citation chain exists. All performance claims rest on external benchmark numbers rather than any quantity that reduces to the training inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the approach relies on standard supervised training of an adapted vision-language model using synthetic data.

pith-pipeline@v0.9.0 · 5620 in / 998 out tokens · 29644 ms · 2026-05-16T21:05:31.956964+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 7.0

MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
cs.CV 2026-05 unverdicted novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 unverdicted novelty 7.0

GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 accept novelty 7.0

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 conditional novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 unverdicted novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
cs.CV 2025-04 unverdicted novelty 7.0

GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using ...
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
cs.CV 2026-04 unverdicted novelty 6.0

UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
cs.AI 2025-12 unverdicted novelty 6.0

EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
cs.AI 2025-03 accept novelty 6.0

UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
Perceptual Flow Network for Visually Grounded Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
cs.AI 2026-04 unverdicted novelty 5.0

LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi...
See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
cs.CV 2026-04 unverdicted novelty 5.0

Multi-turn visual feedback refinement outperforms single-shot coordinate prediction for pixel-precise GUI grounding in complex coding environments.
Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
cs.CV 2026-03 unverdicted novelty 5.0

Empirical study finds background semantics, random pruning, and recency-based allocation improve token efficiency for GUI visual agents.
ZAYA1-VL-8B Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
ClawMobile: Rethinking Smartphone-Native Agentic Systems
cs.MA 2026-02 unverdicted novelty 4.0

ClawMobile proposes a hierarchical system separating probabilistic LLM planning from structured deterministic execution to improve stability and reproducibility of agentic systems on real smartphones.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 18 Pith papers

[1]

click ... then type

GUIAct: We use the annotated data from GUIAct (web-single). Steps that do not involve coordinates or that are marked as multi-step operations (for example, “click ... then type”) are filtered out. We use both the Instruction and Action annotations for grounding (i.e., each element is seen in training twice with different expressions)

work page
[2]

We filter out any actions that do not have associated coordinate data, ensuring that only steps with specific visual grounding targets are included in the dataset

AndroidControl: Similarly, we use the human-annotated actions from the training set. We filter out any actions that do not have associated coordinate data, ensuring that only steps with specific visual grounding targets are included in the dataset

work page
[3]

To enhance diversity, two captions per element are randomly selected from the available set of functional captions during data construction

Widget Caption: For each element in the training set, multiple functional captions are provided. To enhance diversity, two captions per element are randomly selected from the available set of functional captions during data construction

work page
[4]

UIBert: We use the training set elements from UIBert without any additional special processing, directly utilizing the referring expressions provided by this dataset

work page
[5]

These annotations contribute to a more diverse set of referring expressions, particularly for action-oriented grounding tasks

AITZ: We incorporate the annotated actions (Thought) from AITZ, using each step’s action annotation for grounding in the dataset. These annotations contribute to a more diverse set of referring expressions, particularly for action-oriented grounding tasks. F M ODEL AND TRAINING DETAILS F.1 O VERVIEW For flexible investigation of the model architecture, we...

work page 2023
[6]

LLaV A-1.5 Pretraining and Finetuning: We follow the exact pretraining in Liu et al. (2024a). Then, in the instruction finetuning stage, we change the grounding data from normalized coordinates to absolute coordinates as we wish, and start to use our modified AnyRes setting

work page
[7]

Due to the huge computation cost of handling high-resolution images, we use LoRA (Hu et al., 2022) for instruction finetuning in the two stages, with a device batch size of 4

GUI Visual Grounding: Then we train UGround on our training datasets. Due to the huge computation cost of handling high-resolution images, we use LoRA (Hu et al., 2022) for instruction finetuning in the two stages, with a device batch size of 4. The first stage takes about 50 hours on a single 4x NVIDIA A100 machine (global batch size 128 with gradient ac...

work page 2022
[8]

Make sure you understand the task goal to avoid wrong actions

work page
[9]

Ensure you carefully examine the current screenshot and issue a valid action based on the observation

work page
[10]

You should only issue one action at a time

work page
[11]

If it is only partially visible, you need to SCROLL DOWN to see the entire element

The element you want to operate with must be fully visible in the screenshot. If it is only partially visible, you need to SCROLL DOWN to see the entire element

work page
[12]

the target element

The necessary element to achieve the task goal may be located further down the page. If you don’t want to interact with any elements, simply select SCROLL DOWN to move to the section below. Reasoning Explain the action you want to perform and the element you want to operate with (if applicable). Describe your thought process and reason in 3 sentences. Out...

work page 2025