hub Canonical reference

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang · 2025 · cs.AI · arXiv 2503.21620

Canonical reference. 78% of citing Pith papers cite this work as background.

26 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 26 citing papers arXiv PDF

abstract

The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Despite its success in language models, its application in multi-modal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this issue, we propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. Specifically, UI-R1 introduces a novel rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). For efficient training, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. Experimental results demonstrate that our proposed UI-R1-3B achieves significant improvements over the base model (i.e. Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on ANDROIDCONTROL. Furthermore, UI-R1-3B delivers competitive performance compared to larger models (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76K samples. We additionally develop an optimized version, UI-R1-E-3B, which significantly improves both grounding efficiency and accuracy. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain. Code website: https://github.com/lll6gg/UI-R1.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 dataset 1 method 1

citation-polarity summary

background 7 use dataset 1 use method 1

representative citing papers

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

cs.LG · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.

OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

cs.AI · 2026-02-25 · conditional · novelty 7.0

ProactiveMobile is a new benchmark for proactive mobile agents that tests latent intent inference from context and executable API generation, where a fine-tuned 7B model reaches 19.15% success versus 15.71% for o1 and 7.39% for GPT-5.

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

cs.AI · 2025-09-08 · conditional · novelty 7.0

MAS-Bench introduces 139 tasks, 88 predefined shortcuts, and 9 metrics to evaluate hybrid GUI-shortcut mobile agents, reporting up to 68.3% success and 39% efficiency gains over GUI-only baselines.

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using only 0.02% of the data.

Video-R1: Reinforcing Video Reasoning in MLLMs

cs.CV · 2025-03-27 · conditional · novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.

BAMI: Training-Free Bias Mitigation in GUI Grounding

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

AutoGUI-v2 is a new benchmark exposing that VLMs handle basic GUI grounding but struggle with complex interaction logic and state prediction.

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

cs.CL · 2025-09-09 · unverdicted · novelty 6.0

VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserving normal performance.

GTA1: GUI Test-time Scaling Agent

cs.AI · 2025-07-08 · unverdicted · novelty 6.0

GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.

LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization

cs.LG · 2025-06-11 · unverdicted · novelty 6.0

LPO optimizes GUI agent positional accuracy by combining information entropy for zone selection with a physical-distance reward inside a Group Relative Preference Optimization framework, claiming SOTA results on benchmarks and online tests.

Grounded Reinforcement Learning for Visual Reasoning

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

cs.AI · 2025-04-19 · unverdicted · novelty 6.0

InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.

CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision

cs.CV · 2026-05-19 · unverdicted · novelty 5.0

Presents CaptchaBench benchmark and CaptchaMind RL solver achieving 82.9% success on benchmark tasks and 71% on real-world CAPTCHAs via explicit reasoning process supervision.

SE-GA: Memory-Augmented Self-Evolution for GUI Agents

cs.LG · 2026-05-16 · unverdicted · novelty 5.0

SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.

On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

cs.AI · 2026-05-04 · unverdicted · novelty 5.0

Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

cs.AI · 2026-04-19 · unverdicted · novelty 5.0

HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.

From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

cs.AI · 2026-03-25 · unverdicted · novelty 5.0

An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.

citing papers explorer

Showing 26 of 26 citing papers.

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control cs.AI · 2026-05-15 · unverdicted · none · ref 21 · internal anchor
PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment cs.LG · 2026-05-14 · unverdicted · none · ref 96 · 2 links · internal anchor
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents cs.CL · 2026-04-27 · unverdicted · none · ref 26 · internal anchor
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices cs.AI · 2026-02-25 · conditional · none · ref 27 · internal anchor
ProactiveMobile is a new benchmark for proactive mobile agents that tests latent intent inference from context and executable API generation, where a fine-tuned 7B model reaches 19.15% success versus 15.71% for o1 and 7.39% for GPT-5.
MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents cs.AI · 2025-09-08 · conditional · none · ref 24 · internal anchor
MAS-Bench introduces 139 tasks, 88 predefined shortcuts, and 9 metrics to evaluate hybrid GUI-shortcut mobile agents, reporting up to 68.3% success and 39% efficiency gains over GUI-only baselines.
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents cs.CV · 2025-04-14 · unverdicted · none · ref 9 · internal anchor
GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using only 0.02% of the data.
Video-R1: Reinforcing Video Reasoning in MLLMs cs.CV · 2025-03-27 · conditional · none · ref 29 · internal anchor
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 21 · internal anchor
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning cs.AI · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
BAMI: Training-Free Bias Mitigation in GUI Grounding cs.CV · 2026-05-07 · unverdicted · none · ref 18 · internal anchor
BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL cs.DC · 2026-05-07 · unverdicted · none · ref 41 · 2 links · internal anchor
ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.
AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark cs.CV · 2026-04-27 · unverdicted · none · ref 28 · internal anchor
AutoGUI-v2 is a new benchmark exposing that VLMs handle basic GUI grounding but struggle with complex interaction logic and state prediction.
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding cs.CV · 2026-04-15 · unverdicted · none · ref 14 · internal anchor
UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents cs.CL · 2025-09-09 · unverdicted · none · ref 33 · internal anchor
VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserving normal performance.
GTA1: GUI Test-time Scaling Agent cs.AI · 2025-07-08 · unverdicted · none · ref 15 · internal anchor
GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.
LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization cs.LG · 2025-06-11 · unverdicted · none · ref 16 · internal anchor
LPO optimizes GUI agent positional accuracy by combining information entropy for zone selection with a physical-distance reward inside a Group Relative Preference Optimization framework, claiming SOTA results on benchmarks and online tests.
Grounded Reinforcement Learning for Visual Reasoning cs.CV · 2025-05-29 · unverdicted · none · ref 41 · internal anchor
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners cs.AI · 2025-04-19 · unverdicted · none · ref 39 · internal anchor
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.
CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision cs.CV · 2026-05-19 · unverdicted · none · ref 11 · internal anchor
Presents CaptchaBench benchmark and CaptchaMind RL solver achieving 82.9% success on benchmark tasks and 71% on real-world CAPTCHAs via explicit reasoning process supervision.
SE-GA: Memory-Augmented Self-Evolution for GUI Agents cs.LG · 2026-05-16 · unverdicted · none · ref 25 · internal anchor
SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length cs.AI · 2026-05-04 · unverdicted · none · ref 65 · internal anchor
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG · 2026-04-23 · unverdicted · none · ref 86 · internal anchor
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents cs.AI · 2026-04-19 · unverdicted · none · ref 42 · internal anchor
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments cs.AI · 2026-03-25 · unverdicted · none · ref 212 · internal anchor
An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability cs.CL · 2026-05-08 · unverdicted · none · ref 112 · internal anchor
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.
Rethinking Agentic Reinforcement Learning In Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 55 · 3 links · internal anchor
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer