Mobile-Agent-v3: Fundamental Agents for GUI Automation

Fei Huang; Feiyu Gao; Haiyang Xu; Haowei Liu; Jiabo Ye; Jingren Zhou; Jitong Liao; Junjie Cao; Junyang Wang; Ming Yan

arxiv: 2508.15144 · v2 · pith:T2ETCQLGnew · submitted 2025-08-21 · 💻 cs.AI

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Jiabo Ye , Xi Zhang , Haiyang Xu , Haowei Liu , Junyang Wang , Zhaoqing Zhu , Ziwei Zheng , Feiyu Gao

show 7 more authors

Junjie Cao Zhengxi Lu Jitong Liao Qi Zheng Fei Huang Jingren Zhou Ming Yan

This is my paper

Pith reviewed 2026-05-20 11:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords GUI agentsmobile automationreinforcement learningself-evolving trajectoriesAndroidWorldOSWorldUI grounding

0 comments

The pith

GUI-Owl and Mobile-Agent-v3 set new open-source records for GUI agents on Android and desktop benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops GUI-Owl as a base model for GUI agents that handles grounding, planning, and decision making across mobile and desktop systems. It relies on a cloud infrastructure to automatically generate and refine large sets of interaction trajectories in a self-improving loop that needs little manual labeling. Mobile-Agent-v3 then layers on scalable reinforcement learning with a new relative policy optimization method to better match real device use. These steps produce clear gains on standard benchmarks. Readers would care because reliable GUI agents could automate routine phone and computer tasks across many apps without per-app scripting.

Core claim

GUI-Owl achieves state-of-the-art results among open-source end-to-end models on ten GUI benchmarks by incorporating large-scale environment infrastructure for self-evolving trajectory production, diverse foundational agent capabilities for end-to-end decision-making, and scalable environment reinforcement learning with Trajectory-aware Relative Policy Optimization. This leads to Mobile-Agent-v3 improving the scores to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks.

What carries the argument

Self-Evolving GUI Trajectory Production framework that generates high-quality interaction data via automated query generation, correctness validation, and iterative refinement in a self-improving loop.

If this is right

The model supports end-to-end decision making and serves as a modular component in multi-agent systems.
Scalable asynchronous RL training with TRPO improves online performance on complex tasks such as OSWorld.
The infrastructure supports diverse data pipelines across Android, Ubuntu, macOS, and Windows while cutting manual annotation.
Performance gains show stronger integration of UI grounding, planning, action semantics, and reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-evolving data loop could transfer to training agents for web interfaces or other software environments.
Accurate virtual environments might enable direct deployment of these agents on user-owned devices with minimal retraining.
Combining the framework with existing multi-agent setups could produce more general automation for everyday computing.

Load-bearing premise

Cloud-based virtual environments accurately reproduce the timing, rendering, and error modes of real user devices so that collected trajectories transfer without large distribution shift.

What would settle it

Running the trained Mobile-Agent-v3 agents on physical Android devices and real desktop machines and checking whether success rates match the reported virtual benchmark numbers.

read the original abstract

This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cloud-scale self-evolving trajectories give Mobile-Agent-v3 a boost on GUI benchmarks, but virtual-to-real transfer remains an open question.

read the letter

The paper's main advance is a cloud setup spanning multiple OSes that generates and refines GUI interaction trajectories at scale, leading to better performance from GUI-Owl-7B and then Mobile-Agent-v3 on AndroidWorld and OSWorld. They report 73.3 and 37.7 respectively, which beats their own baseline and sets a new mark for open-source frameworks. The self-evolving loop uses automated query generation and correctness checks, then feeds back into the model for better data. They add Trajectory-aware Relative Policy Optimization for the RL stage and run it asynchronously. Releasing the full code and model is a real help. This approach stands out for handling diverse environments in one infrastructure and reducing reliance on hand-labeled data. The model combines grounding, planning, and reasoning in an end-to-end way that can plug into larger systems. Those are useful engineering steps beyond earlier single-platform agents. The weaker parts are the lack of supporting details. No error bars or ablations appear in the abstract, making it tough to gauge how reliable or attributable the gains are. The validation step in the trajectory loop could use more description. The virtual cloud environments are central, but if they diverge from real devices in timing or failure modes, the results may not carry over directly. That transfer assumption is worth testing explicitly. Readers working on practical GUI agents for automation or testing will find the benchmarks and released artifacts valuable. It gives a concrete open-source option to compare against. The work shows clear empirical progress and honest engagement with scaling data for agents, so it deserves a serious referee even if revisions will be needed for full methods. I recommend sending it out for peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces GUI-Owl, a 7B foundational end-to-end GUI agent model that achieves SOTA results among open-source models on ten GUI benchmarks across desktop and mobile settings, reporting 66.4 on AndroidWorld and 29.4 on OSWorld. It further presents Mobile-Agent-v3, a general-purpose framework that improves these scores to 73.3 and 37.7 respectively and claims new SOTA for open-source GUI agent frameworks. Core contributions include a cloud-based virtual environment infrastructure enabling a Self-Evolving GUI Trajectory Production framework (with automated query generation and correctness validation), integration of UI grounding/planning/reasoning capabilities, and a scalable asynchronous RL setup using Trajectory-aware Relative Policy Optimization (TRPO) that achieves 34.9 on OSWorld. Models and code are released at https://github.com/X-PLUG/MobileAgent.

Significance. If the benchmark gains are robust, the work would meaningfully advance open-source GUI agents by demonstrating a scalable, low-annotation pipeline for generating interaction trajectories and aligning them via RL. The concrete numbers on AndroidWorld and OSWorld, combined with the open release of code and models, provide a useful baseline and resource for the community. The self-evolving loop and TRPO formulation represent practical engineering contributions that could generalize to other agent settings.

major comments (2)

[Section 3] Section 3 (Large-scale Environment Infrastructure and Self-Evolving GUI Trajectory Production): The central empirical claims rest on trajectories and policies produced inside the authors' cloud-based virtual Android/Ubuntu/macOS/Windows instances. The manuscript provides no experiments or metrics quantifying fidelity to the AndroidWorld and OSWorld benchmark environments with respect to screen rendering, input latency, or failure modes. This is load-bearing for the reported improvements (e.g., GUI-Owl-7B at 66.4/29.4 and Mobile-Agent-v3 at 73.3/37.7), as any systematic mismatch would imply distribution shift and prevent apples-to-apples comparison with prior baselines.
[Scalable Environment RL] Scalable Environment RL section: The description of the fully asynchronous training framework and the introduced TRPO variant lacks (a) the explicit reward function or correctness-validation criteria used inside the self-evolving loop, (b) ablation tables isolating the contribution of TRPO versus standard methods, and (c) error bars or statistical tests on the 34.9 OSWorld score. These omissions make it difficult to assess whether the RL component is responsible for the observed gains or whether the results are sensitive to implementation details.

minor comments (2)

The abstract states results on 'ten GUI benchmarks' but only details AndroidWorld and OSWorld; a single summary table aggregating all ten would improve readability and allow direct comparison with prior work.
Notation for TRPO (Trajectory-aware Relative Policy Optimization) is introduced without a formal equation or pseudocode; adding a concise algorithmic box would clarify the modification relative to standard TRPO.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity and strengthen the empirical presentation without misrepresenting the current manuscript.

read point-by-point responses

Referee: [Section 3] Section 3 (Large-scale Environment Infrastructure and Self-Evolving GUI Trajectory Production): The central empirical claims rest on trajectories and policies produced inside the authors' cloud-based virtual Android/Ubuntu/macOS/Windows instances. The manuscript provides no experiments or metrics quantifying fidelity to the AndroidWorld and OSWorld benchmark environments with respect to screen rendering, input latency, or failure modes. This is load-bearing for the reported improvements (e.g., GUI-Owl-7B at 66.4/29.4 and Mobile-Agent-v3 at 73.3/37.7), as any systematic mismatch would imply distribution shift and prevent apples-to-apples comparison with prior baselines.

Authors: We appreciate the referee's emphasis on environment fidelity, which is indeed important for validating the training pipeline. Our cloud-based virtual environments are configured using standard Android emulators and desktop virtualization stacks chosen to match the OS versions, screen resolutions, and action spaces specified in AndroidWorld and OSWorld. The self-evolving trajectories are generated and validated against the same UI element hierarchies and interaction semantics used in the benchmarks. However, the manuscript does not currently include quantitative side-by-side metrics on rendering pixel fidelity, input latency distributions, or failure-mode statistics. To address this concern directly, we will add a new subsection in Section 3 that details the environment configuration parameters, provides qualitative alignment arguments, and discusses why minor discrepancies are unlikely to explain the consistent gains observed across ten benchmarks. We believe this addition will allow readers to better evaluate potential distribution shift. revision: yes
Referee: [Scalable Environment RL] Scalable Environment RL section: The description of the fully asynchronous training framework and the introduced TRPO variant lacks (a) the explicit reward function or correctness-validation criteria used inside the self-evolving loop, (b) ablation tables isolating the contribution of TRPO versus standard methods, and (c) error bars or statistical tests on the 34.9 OSWorld score. These omissions make it difficult to assess whether the RL component is responsible for the observed gains or whether the results are sensitive to implementation details.

Authors: We agree that these elements are necessary for full reproducibility and for isolating the contribution of Trajectory-aware Relative Policy Optimization (TRPO). In the revised manuscript we will (a) explicitly state the reward function and the automated correctness-validation criteria applied within the self-evolving loop, (b) add ablation tables comparing TRPO against standard PPO and other online RL baselines under identical data and environment conditions, and (c) report error bars together with statistical significance tests (e.g., standard deviation and p-values across multiple random seeds) for the 34.9 OSWorld result. These revisions will clarify the role of the asynchronous RL framework and the specific TRPO formulation in the reported performance. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark claims or self-evolving data pipeline

full rationale

The paper reports measured performance on external public benchmarks (AndroidWorld, OSWorld) after training and evaluation in cloud virtual environments. The Self-Evolving GUI Trajectory Production framework is described as an iterative empirical data-generation and RL procedure that uses the model to refine trajectories, but the final reported scores are not derived by construction from fitted parameters or self-referential definitions; they remain independently falsifiable on held-out benchmarks. No equations, uniqueness theorems, or self-citation chains are invoked to force the results. This is a standard empirical agent paper whose central claims rest on external evaluation rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that virtual environments match real GUI dynamics and that automated correctness validation produces high-quality trajectories without introducing systematic bias.

axioms (1)

domain assumption Virtual cloud environments reproduce real-device timing, rendering, and failure modes sufficiently for policy transfer.
Invoked when claiming that trajectories generated inside the infrastructure are useful for real-world alignment.

pith-pipeline@v0.9.0 · 5871 in / 1231 out tokens · 32087 ms · 2026-05-20T11:55:29.452344+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents
cs.CR 2026-01 unverdicted novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
cs.AI 2025-12 accept novelty 8.0

MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
cs.AI 2026-05 unverdicted novelty 7.0

An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.
Faithful Mobile GUI Agents with Guided Advantage Estimator
cs.AI 2026-05 unverdicted novelty 7.0

Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 conditional novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 unverdicted novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
cs.CL 2026-04 unverdicted novelty 7.0

OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
cs.AI 2026-02 unverdicted novelty 7.0

The work creates a new benchmark for humanizing GUI agent touch dynamics via a MinMax detector-agent model, a mobile touch dataset, and methods showing agents can match human behavior without losing task performance.
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
cs.AI 2026-05 unverdicted novelty 6.0

AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.
DocOS: Towards Proactive Document-Guided Actions in GUI Agents
cs.AI 2026-05 unverdicted novelty 6.0

Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization
cs.AI 2026-04 unverdicted novelty 6.0

TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
cs.CR 2026-04 unverdicted novelty 6.0

Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
cs.AI 2025-12 conditional novelty 6.0

AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
cs.AI 2025-10 unverdicted novelty 6.0

MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
cs.CL 2025-09 unverdicted novelty 6.0

VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserv...
SE-GA: Memory-Augmented Self-Evolution for GUI Agents
cs.LG 2026-05 unverdicted novelty 5.0

SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
cs.AI 2026-05 unverdicted novelty 5.0

An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
cs.AI 2026-04 unverdicted novelty 5.0

HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics
cs.LG 2026-02 unverdicted novelty 5.0

UI-Oceanus shows that continual pre-training on forward dynamics predictions from synthetic GUI exploration improves agent success rates by 7% offline and 16.8% online, with gains scaling by data volume.