Recognition: no theorem link
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Pith reviewed 2026-05-16 21:05 UTC · model grok-4.3
The pith
A model trained on web synthetic GUI data enables agents to ground referring expressions to pixels using only screenshots, outperforming text-augmented systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UGround is a universal visual grounding model trained on 10M GUI elements and referring expressions across 1.3M screenshots collected via web synthetic data, using a slight adaptation of the LLaVA architecture to map diverse expressions to pixel coordinates. This enables GUI agents that perceive the environment entirely visually and perform pixel-level operations, delivering up to 20 percent absolute gains over prior visual grounding models and outperforming state-of-the-art agents despite those agents using additional text-based inputs.
What carries the argument
UGround, an adapted LLaVA model trained on web synthetic data that maps referring expressions of GUI elements to their pixel coordinates on screenshots.
If this is right
- Pure visual agents surpass hybrid agents that receive extra text input on grounding, offline agent, and online agent tasks.
- Synthetic web data suffices to train models that generalize across multiple GUI platforms without platform-specific text parsing.
- Direct pixel-level operations reduce overhead from noisy or incomplete text representations.
- Agents require only visual perception to achieve higher benchmark scores than prior state-of-the-art systems.
Where Pith is reading between the lines
- Similar synthetic-data recipes could extend visual grounding to dynamic or non-web interfaces such as mobile apps and desktop software.
- Combining UGround with improved planning modules might support longer multi-step tasks while staying visual-only.
- The approach could lower engineering effort for new platforms by removing dependence on text extraction pipelines.
Load-bearing premise
Web-based synthetic data plus minor LLaVA adaptation produces a model that generalizes to real-world diverse GUI platforms and referring expressions.
What would settle it
A new benchmark using GUI platforms or referring expressions absent from the web synthetic training data on which UGround shows no improvement or falls below text-based baselines.
read the original abstract
Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly perform pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents. Empirical results on six benchmarks spanning three categories (grounding, offline agent, and online agent) show that 1) UGround substantially outperforms existing visual grounding models for GUI agents, by up to 20% absolute, and 2) agents with UGround outperform state-of-the-art agents, despite the fact that existing agents use additional text-based input while ours only uses visual perception. These results provide strong support for the feasibility and promises of GUI agents that navigate the digital world as humans do.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UGround, a visual grounding model for GUI agents trained via a simple recipe of web-based synthetic data (1.3M screenshots, 10M elements) and minor adaptation of the LLaVA architecture. It claims this yields a universal pixel-level grounder that outperforms prior visual grounding models by up to 20% absolute across six benchmarks in three categories (grounding, offline agent, online agent tasks) and enables purely visual agents to surpass SOTA agents that rely on additional text inputs such as HTML or accessibility trees.
Significance. If the generalization claims hold, the work is significant: it supplies the largest reported GUI grounding dataset to date and provides multi-category empirical evidence that a purely visual, human-like perception pipeline can outperform text-augmented baselines. This supports a shift away from noisy structured representations toward direct pixel grounding, with potential benefits for robustness and reduced overhead in real-world GUI agents.
major comments (3)
- [Abstract and §4] Abstract and empirical evaluation section: the central claim that UGround provides 'universal' visual grounding (up to 20% gains and outperformance of text-augmented SOTA) depends on robust transfer from web synthetic training data to the test distributions. No domain-shift analysis is presented (e.g., fraction of mobile vs. desktop screenshots, native-app vs. browser-rendered elements, or overlap statistics between the 10M training elements and the six evaluation sets). This is load-bearing for the headline result.
- [§4] §4 (Empirical Evaluation): insufficient detail on data splits, baseline reproduction, and potential synthetic-data biases. It is unclear whether any of the six benchmarks overlap with the synthetic collection process or how existing visual grounders were re-implemented, which directly affects interpretation of the reported margins.
- [§3] §3 (Method): the 'slight adaptation' of LLaVA is described at a high level only. Concrete specification of modified components, training objectives, and hyper-parameters is needed to evaluate whether the gains stem from the data scale, the architecture change, or both.
minor comments (2)
- [Abstract] Abstract: the six benchmarks are referred to only by category; naming them explicitly would aid readers in assessing coverage.
- [§4] Tables in §4: performance numbers should include standard deviations or confidence intervals so that the magnitude of the 20% gains can be statistically evaluated.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional analyses and details that strengthen the claims regarding transfer and reproducibility.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and empirical evaluation section: the central claim that UGround provides 'universal' visual grounding (up to 20% gains and outperformance of text-augmented SOTA) depends on robust transfer from web synthetic training data to the test distributions. No domain-shift analysis is presented (e.g., fraction of mobile vs. desktop screenshots, native-app vs. browser-rendered elements, or overlap statistics between the 10M training elements and the six evaluation sets). This is load-bearing for the headline result.
Authors: We agree that explicit domain-shift analysis is valuable for supporting the universality claim. In the revised manuscript we have added a new subsection (Section 4.3) that reports the training data composition, including approximate fractions of mobile versus desktop screenshots and browser-rendered versus native-app elements based on source metadata. We also include pairwise similarity metrics (CLIP embedding cosine similarity and element-type distribution KL divergence) between the 1.3 M training screenshots and each of the six evaluation sets to quantify distribution shift. While exhaustive element-level overlap is intractable at this scale, the reported metrics show moderate-to-high similarity on grounding benchmarks and lower similarity on agent benchmarks, consistent with the observed performance gains. revision: yes
-
Referee: [§4] §4 (Empirical Evaluation): insufficient detail on data splits, baseline reproduction, and potential synthetic-data biases. It is unclear whether any of the six benchmarks overlap with the synthetic collection process or how existing visual grounders were re-implemented, which directly affects interpretation of the reported margins.
Authors: We have expanded Section 4.1 and the appendix with a dedicated paragraph on data collection safeguards: the synthetic pipeline used distinct web domains and randomization seeds that were explicitly excluded from the six public benchmarks. We now list the exact train/validation splits used for UGround (90/10 on the 1.3 M screenshots) and provide the precise re-implementation details for all baselines, including the GitHub commit hashes, hyper-parameters, and prompt templates employed. Potential synthetic-data biases (e.g., over-representation of common UI patterns) are discussed with a new ablation that subsamples the training set by element frequency; the performance margins remain stable, indicating that the gains are not driven by simple frequency bias. revision: yes
-
Referee: [§3] §3 (Method): the 'slight adaptation' of LLaVA is described at a high level only. Concrete specification of modified components, training objectives, and hyper-parameters is needed to evaluate whether the gains stem from the data scale, the architecture change, or both.
Authors: We have rewritten Section 3.2 to provide the requested concrete specification. The modifications consist of (1) replacing the original vision encoder projection with a two-layer MLP that directly outputs normalized (x,y) coordinates, (2) adding a coordinate regression head with L1 + IoU loss, and (3) freezing the language model while fine-tuning only the vision encoder and projection layers. All hyper-parameters are now listed in Table 2 (learning rate 2e-5, batch size 128, 3 epochs, AdamW with cosine decay). An ablation study (new Table 3) isolates the contribution of data scale versus these architectural changes, showing that both factors are necessary to reach the reported performance. revision: yes
Circularity Check
No significant circularity in the empirical pipeline
full rationale
The paper presents an empirical approach: collection of a new 1.3M-screenshot / 10M-element GUI dataset from web sources, slight adaptation of the external LLaVA architecture, and direct evaluation on six independent benchmarks spanning grounding, offline, and online agent tasks. No mathematical derivation, parameter-fitting step that is later relabeled as a prediction, or load-bearing self-citation chain exists. All performance claims rest on external benchmark numbers rather than any quantity that reduces to the training inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 21 Pith papers
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
-
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using ...
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
-
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
-
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
-
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
-
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
Perceptual Flow Network for Visually Grounded Reasoning
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi...
-
See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
Multi-turn visual feedback refinement outperforms single-shot coordinate prediction for pixel-precise GUI grounding in complex coding environments.
-
Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
Empirical study finds background semantics, random pruning, and recency-based allocation improve token efficiency for GUI visual agents.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
-
ClawMobile: Rethinking Smartphone-Native Agentic Systems
ClawMobile proposes a hierarchical system separating probabilistic LLM planning from structured deterministic execution to improve stability and reproducibility of agentic systems on real smartphones.
Reference graph
Works this paper leans on
-
[1]
GUIAct: We use the annotated data from GUIAct (web-single). Steps that do not involve coordinates or that are marked as multi-step operations (for example, “click ... then type”) are filtered out. We use both the Instruction and Action annotations for grounding (i.e., each element is seen in training twice with different expressions)
-
[2]
AndroidControl: Similarly, we use the human-annotated actions from the training set. We filter out any actions that do not have associated coordinate data, ensuring that only steps with specific visual grounding targets are included in the dataset
-
[3]
Widget Caption: For each element in the training set, multiple functional captions are provided. To enhance diversity, two captions per element are randomly selected from the available set of functional captions during data construction
-
[4]
UIBert: We use the training set elements from UIBert without any additional special processing, directly utilizing the referring expressions provided by this dataset
-
[5]
AITZ: We incorporate the annotated actions (Thought) from AITZ, using each step’s action annotation for grounding in the dataset. These annotations contribute to a more diverse set of referring expressions, particularly for action-oriented grounding tasks. F M ODEL AND TRAINING DETAILS F.1 O VERVIEW For flexible investigation of the model architecture, we...
work page 2023
-
[6]
LLaV A-1.5 Pretraining and Finetuning: We follow the exact pretraining in Liu et al. (2024a). Then, in the instruction finetuning stage, we change the grounding data from normalized coordinates to absolute coordinates as we wish, and start to use our modified AnyRes setting
-
[7]
GUI Visual Grounding: Then we train UGround on our training datasets. Due to the huge computation cost of handling high-resolution images, we use LoRA (Hu et al., 2022) for instruction finetuning in the two stages, with a device batch size of 4. The first stage takes about 50 hours on a single 4x NVIDIA A100 machine (global batch size 128 with gradient ac...
work page 2022
-
[8]
Make sure you understand the task goal to avoid wrong actions
-
[9]
Ensure you carefully examine the current screenshot and issue a valid action based on the observation
-
[10]
You should only issue one action at a time
-
[11]
If it is only partially visible, you need to SCROLL DOWN to see the entire element
The element you want to operate with must be fully visible in the screenshot. If it is only partially visible, you need to SCROLL DOWN to see the entire element
-
[12]
The necessary element to achieve the task goal may be located further down the page. If you don’t want to interact with any elements, simply select SCROLL DOWN to move to the section below. Reasoning Explain the action you want to perform and the element you want to operate with (if applicable). Describe your thought process and reason in 3 sentences. Out...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.