UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Bingqi Chen; Hui Li; Jia Ma; Rongrong Ji; Shuquan Lian; Xiawu Zheng; Yifan Ding; Yuhang Wu; Zihan Song

arxiv: 2507.22025 · v4 · submitted 2025-07-29 · 💻 cs.AI · cs.CL· cs.CV

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Shuquan Lian , Yuhang Wu , Jia Ma , Yifan Ding , Zihan Song , Bingqi Chen , Xiawu Zheng , Hui Li

show 1 more author

Rongrong Ji

This is my paper

Pith reviewed 2026-05-19 02:09 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords GUI agentsmultimodal LLMsgrounding accuracycontinuous rewarddecomposed groundingScreenSpot benchmarkssupervised fine-tuninginference-time selection

0 comments

The pith

UI-AGILE advances GUI agents by pairing continuous rewards and cropping in training with decomposed grounding at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets persistent problems in GUI agents built on multimodal models, specifically ineffective reward signals, unbalanced reasoning, and visual noise on detailed screens. It adds three changes to supervised fine-tuning: a continuous reward that directly rewards precise element localization, a simple thinking reward that keeps planning short while preserving accuracy, and a cropping resampling step that generates more useful training examples for hard cases. At inference the method splits high-resolution images into smaller regions and selects the best local prediction. These steps together produce state-of-the-art grounding scores on ScreenSpot-Pro and ScreenSpot-v2 plus stronger overall agent behavior.

Core claim

UI-AGILE shows that a continuous reward function, a simple thinking reward, and cropping-based resampling during training, combined with decomposed grounding with selection at inference, raise grounding accuracy and general agent performance on ScreenSpot-Pro and ScreenSpot-v2.

What carries the argument

Decomposed grounding with selection, which divides a high-resolution screen image into smaller manageable parts and picks the most reliable local prediction.

If this is right

GUI agents locate interface elements more reliably on complex high-resolution displays.
Training produces agents that plan quickly without sacrificing grounding precision.
Resampling lets agents learn effectively from tasks that previously gave almost no reward signal.
Overall agent success rates rise on real software interaction benchmarks beyond pure grounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same image-partition idea at inference could help other vision-language agents that must act on detailed scenes.
The reward and resampling changes might let smaller base models reach performance levels that currently require larger ones.
Combining this pipeline with larger-scale reinforcement learning loops is a natural next step for further gains.

Load-bearing premise

The continuous reward, simple thinking reward, and cropping resampling actually reduce sparse rewards and visual noise without creating new overfitting or benchmark-specific biases.

What would settle it

Running the full UI-AGILE pipeline on a fresh GUI benchmark never seen during development and finding zero or negative accuracy gain would falsify the central claim.

read the original abstract

The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE for enhancing GUI agents at both training and inference. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a continuous reward function to incentivize high-precision grounding; 2) a ``Simple Thinking'' reward to balance planning with speed and grounding accuracy; and 3) a cropping-based resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present decomposed grounding with selection to dramatically improve grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art grounding performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2 while it also exhibits strong general agent capabilities. For instance, using both our training and inference enhancement methods brings 23\% grounding accuracy improvement over the best baseline on ScreenSpot-Pro. We provide the code in https://github.com/KDEGroup/UI-AGILE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces UI-AGILE to improve GUI agents via training enhancements to SFT (a continuous reward function for high-precision grounding, a 'Simple Thinking' reward to balance planning/speed/accuracy, and cropping-based resampling to address sparse rewards on complex tasks) and an inference-time method of decomposed grounding with selection to handle high-resolution displays. It reports state-of-the-art grounding accuracy on ScreenSpot-Pro and ScreenSpot-v2, with a specific claim that combining the training and inference methods yields a 23% improvement over the best baseline on ScreenSpot-Pro. The work also asserts strong general agent capabilities, and code is released.

Significance. If the reported gains prove robust, the work would offer practical, implementable improvements to GUI agent training and inference that directly target sparse rewards and visual noise in complex UIs. The open release of code is a clear strength that supports reproducibility. The empirical focus on public benchmarks like ScreenSpot-Pro makes the contribution testable, though the magnitude of advance depends on whether the gains generalize beyond the evaluated distributions.

major comments (2)

[§4 and abstract] §4 (Experiments) and abstract: The headline 23% grounding accuracy improvement on ScreenSpot-Pro is presented as arising from the combination of the three training modifications plus decomposed inference, yet the manuscript supplies no ablation tables isolating each component's contribution, no statistical significance tests across runs, and no explicit description of baseline model selection or exact reward formulations. This leaves the central empirical claim under-supported.
[§3.1] §3.1 (Training improvements): The continuous reward, simple thinking reward, and cropping resampling are asserted to mitigate sparse rewards and visual noise without introducing overfitting or benchmark-specific biases, but no sensitivity analysis, hyperparameter details, or cross-benchmark validation is provided to substantiate that these modifications produce additive, generalizable gains rather than exploiting ScreenSpot-Pro's screen-size or annotation characteristics.

minor comments (2)

The abstract and method descriptions would benefit from a single consolidated equation or pseudocode block defining the combined reward function to improve clarity.
Figure captions for any grounding visualizations should explicitly state the resolution and UI complexity of the examples shown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, clarifying our empirical support and outlining revisions to strengthen the presentation of results.

read point-by-point responses

Referee: [§4 and abstract] §4 (Experiments) and abstract: The headline 23% grounding accuracy improvement on ScreenSpot-Pro is presented as arising from the combination of the three training modifications plus decomposed inference, yet the manuscript supplies no ablation tables isolating each component's contribution, no statistical significance tests across runs, and no explicit description of baseline model selection or exact reward formulations. This leaves the central empirical claim under-supported.

Authors: We agree that additional ablations would better isolate contributions to the reported 23% gain. In the revised version, we will add an ablation table in Section 4 showing incremental performance from each training component (continuous reward, simple thinking reward, cropping resampling) and the inference-time method. We will also report results over multiple random seeds with standard deviations to address statistical significance. Baseline selection followed the strongest publicly reported models on the same benchmarks (as cited in Section 4); we will make this criterion explicit. Reward formulations are defined mathematically in Section 3.1; we will add cross-references in the experiments section for clarity. revision: yes
Referee: [§3.1] §3.1 (Training improvements): The continuous reward, simple thinking reward, and cropping resampling are asserted to mitigate sparse rewards and visual noise without introducing overfitting or benchmark-specific biases, but no sensitivity analysis, hyperparameter details, or cross-benchmark validation is provided to substantiate that these modifications produce additive, generalizable gains rather than exploiting ScreenSpot-Pro's screen-size or annotation characteristics.

Authors: We acknowledge that further analysis would strengthen claims of generalizability. The revised manuscript will include a sensitivity analysis appendix for key hyperparameters (e.g., continuous reward scaling factor and cropping thresholds). Expanded hyperparameter details will be provided. Cross-benchmark validation is already shown via consistent gains on both ScreenSpot-Pro and ScreenSpot-v2, which differ in resolution and complexity; we will add a discussion explaining the design choices (dynamic cropping based on task sparsity rather than fixed benchmark traits) to mitigate concerns of overfitting or bias. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results on GUI agents

full rationale

The paper describes proposed training modifications (continuous reward function, simple thinking reward, cropping resampling) and an inference method (decomposed grounding with selection), then reports experimental grounding accuracy gains on external benchmarks ScreenSpot-Pro and ScreenSpot-v2. The 23% improvement is presented as an observed outcome from these changes, not a mathematical prediction or first-principles result that reduces to the same fitted parameters or inputs by construction. No equations, self-definitional steps, or load-bearing self-citations appear in the provided abstract and methods framing. The work relies on public code and standard benchmark comparisons, making the central claims self-contained against external evaluation rather than internally circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view supplies no explicit free parameters, axioms, or invented entities; the approach rests on standard SFT/RL machinery plus custom reward designs whose hyperparameters are not detailed here.

pith-pipeline@v0.9.0 · 5792 in / 1080 out tokens · 59069 ms · 2026-05-19T02:09:06.301411+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

continuous grounding reward ... R(x, y) = 1 + exp(−4 · d²_norm) if (x, y) ∈ BBox
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Rlength(L) uses cosine for smooth degradation between lmin/lideal/lmax

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
cs.AI 2026-05 unverdicted novelty 7.0

PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 conditional novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 unverdicted novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

UI-Copilot adds a selective copilot for memory and math to GUI agents and trains tool use with separate single-turn and multi-turn optimization, yielding SOTA results on MemGUI-Bench and a 17.1% gain on AndroidWorld.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context
cs.AI 2026-04 unverdicted novelty 5.0

SWE-AGILE introduces a Dynamic Reasoning Context with sliding windows of detailed steps and compressed Reasoning Digests to enable efficient long-horizon reasoning in software engineering agents, claiming new benchmar...
Rethinking Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 3.0

The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
Rethinking Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
Rethinking Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.