arxiv: 2506.20332 · v4 · submitted 2025-06-25 · 💻 cs.AI

Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training

Jihao Gu , Qihang Ai , Yingyao Wang , Pi Bu , Jingxuan Xing , Zekun Zhu , Wei Jiang , Ziming Wang

show 5 more authors

Yingxiu Zhao Ming-Liang Zhang Jun Song Yuning Jiang Bo Zheng

This is my paper

Pith reviewed 2026-05-19 08:22 UTC · model grok-4.3

classification 💻 cs.AI

keywords mobile agentsvision-language modelsreinforcement learninghierarchical trainingGUI interactionself-correctionexplorationChinese dataset

0 comments p. Extension

The pith

A three-stage hierarchical curriculum trains vision-language models to explore mobile screens and correct their own errors more effectively than prior offline methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that mobile agents built on vision-language models often stall in local optima when trained only on fixed data or immediate action rewards. It proposes Mobile-R1, a staged recipe that first aligns reasoning format, then uses on-policy exploration with verifiable feedback, and finally applies multi-turn task rewards inside a real environment. This progression is claimed to unlock sustained exploration and self-correction during complex GUI tasks. The authors support the work with a new Chinese mobile dataset of 24,521 annotations across 28 apps and a 500-trajectory benchmark. If the claim holds, agents could move from brittle step-by-step execution toward more autonomous handling of multi-step mobile interactions.

Core claim

Mobile-R1 bridges atomic action execution and strategic task completion through a hierarchical curriculum of three stages: format alignment for reasoning structure, on-policy exploration with verifiable action feedback to ground basic execution, and multi-turn task-level training inside a realistic environment. This sequence bootstraps the agent and produces improved exploration together with self-correction capabilities that prior offline or local-reward approaches lacked.

What carries the argument

The three-stage hierarchical curriculum that progressively moves from reasoning alignment to grounded on-policy exploration to full task-level training with environment feedback.

If this is right

Agents gain the ability to recover from mistakes during long GUI sequences rather than repeating failed actions.
The method reduces reliance on large amounts of pre-collected offline trajectories for effective training.
A publicly released Chinese GUI dataset and benchmark enable evaluation of agents in non-English mobile ecosystems.
The approach can be applied to other VLM-based agents that interact with structured screen environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar staged curricula could be tested on web or desktop interfaces where feedback is also sparse.
Combining the method with larger base models might further amplify the observed self-correction gains.
The open-sourced resources could serve as a shared testbed for comparing interactive training recipes across research groups.

Load-bearing premise

On-policy exploration with verifiable action feedback plus multi-turn task training in a realistic environment will produce better exploration and error correction than offline or local-reward training without introducing new convergence or overfitting problems.

What would settle it

If agents trained with the full three-stage curriculum show no reduction in local-optima failures compared with agents trained only on the first two stages when evaluated on the 500-trajectory benchmark, the added value of the final multi-turn stage would be falsified.

Figures

Figures reproduced from arXiv: 2506.20332 by Bo Zheng, Jihao Gu, Jingxuan Xing, Jun Song, Ming-Liang Zhang, Pi Bu, Qihang Ai, Wei Jiang, Yingxiu Zhao, Yingyao Wang, Yuning Jiang, Zekun Zhu, Ziming Wang.

**Figure 1.** Figure 1: Comparison of Action-Level and Task-Level Rewards. Figure (a) illustrates an action-induced agent that is encouraged [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Pipeline of data collection. with R1-V exploring Reinforcement Learning with Verifiable Reward (RLVR) to boost VLMs’ visual reasoning. It demonstrated that RLVR methods exhibit strong out-ofdistribution generalization, whereas SFT excels in training domain tasks (Chen et al. 2025b). Trajectory Dataset To support the community’s efforts in training and evaluating powerful agents, we first created a high-… view at source ↗

**Figure 3.** Figure 3: Our training framework consists of three stages: initial format finetuning, online training via action-level reward, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of trajectory length. tion Trajectory Annotation, 2) Single-step GRPO training with action-level rewards to enhance the format compliance and click accuracy, and 3) Online training with task-level rewards based on multi-turn trajectories to improve the generalization ability. Preliminary Task Definition Mobile agents are driven by textual instructions, understand screenshots of mobile screen… view at source ↗

**Figure 5.** Figure 5: Reward score during Stage3 training [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of reasoning trajectories between Mobile-R1 and Qwen2.5-VL-3B-Instruct on the task “ [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Pass K performance of Mobile-R1. To further assess the generalization capabilities of Mobile-R1, we validated our method’s impact on end-toend task performance by evaluating the success rate across 50 random tasks on physical devices. As detailed in Fiure 8, the model exhibited a consistent improvement trajectory. A minor initial dip in performance was observed during Stage 1, attributable to the action-l… view at source ↗

**Figure 9.** Figure 9: Prompt for Instruction Generation. Prompt for Agent Exploration You are a mobile GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. You are provided with function signatures within <tools></tools> XML tags: <tools> { "name": "mobile_use", "arguments": { "type": "function", "function": { "name_for_human": "mobile_use", "name":… view at source ↗

**Figure 10.** Figure 10: A.4 Prompts of Training System This prompt was designed to ensure consistency with the Qwen2.5-VL’s GUI prompt. The bond indicates the instruction, which can be replaced by any instruction of our dataset. Prompts of Training System You are a mobile GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. You are provided with fun… view at source ↗

**Figure 11.** Figure 11: Comparison of the thinking processes of the Stage2 model and Stage3 model of Mobile-R1 under the same task [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Robustness Analysis on Unseen Applications. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Robustness Analysis on Unseen Applications. [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative result of Mobile-R1 under the task “ [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative result of Mobile-R1 under the task “ [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

read the original abstract

Vision-language model-based mobile agents have gained the ability to understand complex instructions and mobile screenshots, benefiting from reinforcement learning paradigms like Group Relative Policy Optimization (GRPO). However, existing approaches centers on offline training or local action-level rewards often trap agents in local optima, hindering effective exploration and error correction with the environment. Crucially, we find that directly applying task-level rewards often leads to convergence difficulties due to the sparse nature of GUI interactions. To address these challenges, we present \textbf{Mobile-R1}, a systematic training recipe that bridges atomic action execution and strategic task completion. We propose a hierarchical curriculum consisting of three stages: (1) format alignment for reasoning structure, (2) on-policy exploration with verifiable action feedback to ground basic execution, and (3) multi-turn task-level training with realistic environment to unlock exploration and self-correction. This hierarchical strategy effectively bootstraps the agent, significantly enhancing its capability for exploration and self-correction (the ``Eureka'' moments). Furthermore, addressing the critical scarcity of diverse GUI data in non-English ecosystems, we contribute a comprehensive Chinese mobile dataset covering 28 applications with 24,521 high-quality manual annotations, and establish a rigorous benchmark with 500 trajectories. We will open source all resources, including the dataset, benchmark, model weight, and codes: https://mobile-r1.github.io/Mobile-R1/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The three-stage curriculum plus new Chinese GUI dataset is a practical recipe worth testing, but the paper still needs to show numbers and ablations before the gains look solid.

read the letter

The main thing here is that Mobile-R1 lays out a three-stage curriculum to move VLM agents from basic format following to verifiable action feedback and then full multi-turn task training in a realistic environment, plus a new Chinese mobile dataset with 24,521 annotations across 28 apps and a 500-trajectory benchmark. They plan to open-source the lot, which is straightforwardly useful for anyone working on non-English GUI agents or sparse-reward RL for vision-language models.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Mobile-R1, a hierarchical training recipe for VLM-based mobile agents consisting of three stages: (1) format alignment for reasoning structure, (2) on-policy exploration with verifiable action feedback to ground basic execution, and (3) multi-turn task-level training with realistic environment feedback. It claims this curriculum addresses limitations of offline training and local action-level rewards (local optima, sparse GUI interactions causing convergence issues) and unlocks improved exploration and self-correction ('Eureka' moments). The work also contributes a Chinese mobile GUI dataset covering 28 applications with 24,521 high-quality manual annotations and establishes a benchmark with 500 trajectories, with all resources to be open-sourced.

Significance. If the empirical results and ablations confirm that the three-stage curriculum produces measurably superior exploration and error-correction behavior without introducing instability or dataset-specific overfitting, the work would provide a practical, reproducible training recipe for interactive mobile agents and help address data scarcity in non-English GUI settings. The open-sourcing of the dataset, benchmark, model weights, and code would further strengthen its utility for the community.

major comments (3)

[Abstract and method description] The central claim that the hierarchical strategy 'effectively bootstraps the agent, significantly enhancing its capability for exploration and self-correction' is load-bearing yet unsupported by any quantitative results, ablation studies isolating stage 3, training dynamics, variance metrics, or error analysis on the 500-trajectory benchmark. The abstract and method description provide no evidence that multi-turn task-level training reliably avoids the convergence difficulties or overfitting risks noted for direct task-level rewards.
[Introduction and proposed approach] The assertion that 'directly applying task-level rewards often leads to convergence difficulties due to the sparse nature of GUI interactions' is presented without supporting data such as learning curves, success rates across stages, or comparisons to prior offline/local-reward baselines. This leaves the motivation for the specific three-stage ordering unverified.
[Experiments and evaluation] No metrics are reported on self-correction frequency, exploration efficiency, or robustness to the 24,521 Chinese annotations that would demonstrate stage 3's contribution to 'Eureka' moments or rule out new convergence failures.

minor comments (2)

[Method] A diagram or pseudocode outlining the transitions, reward signals, and data flow between the three stages would improve clarity of the curriculum.
[Implementation details] The manuscript should explicitly state the base VLM, exact hyperparameters for each stage, and how on-policy exploration is implemented to allow reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential utility of our hierarchical training recipe and the Chinese GUI dataset. We agree that the central claims regarding improved exploration and self-correction require stronger quantitative support, and we will revise the manuscript accordingly to address the identified gaps in evidence.

read point-by-point responses

Referee: [Abstract and method description] The central claim that the hierarchical strategy 'effectively bootstraps the agent, significantly enhancing its capability for exploration and self-correction' is load-bearing yet unsupported by any quantitative results, ablation studies isolating stage 3, training dynamics, variance metrics, or error analysis on the 500-trajectory benchmark. The abstract and method description provide no evidence that multi-turn task-level training reliably avoids the convergence difficulties or overfitting risks noted for direct task-level rewards.

Authors: We acknowledge that the abstract and method sections would benefit from more explicit quantitative backing for the load-bearing claims. While the full paper reports overall benchmark improvements from the full curriculum, we agree that isolating stage 3's contribution and providing supporting analyses are necessary. In the revision we will add: (i) ablation studies comparing the three-stage curriculum against a two-stage variant (omitting multi-turn task-level training), (ii) training dynamics and variance metrics across multiple runs on the 500-trajectory benchmark, and (iii) an error analysis quantifying self-correction events. We will also include a brief discussion of observed convergence behavior under direct task-level rewards versus the staged approach. revision: yes
Referee: [Introduction and proposed approach] The assertion that 'directly applying task-level rewards often leads to convergence difficulties due to the sparse nature of GUI interactions' is presented without supporting data such as learning curves, success rates across stages, or comparisons to prior offline/local-reward baselines. This leaves the motivation for the specific three-stage ordering unverified.

Authors: We agree that empirical motivation for the three-stage ordering should be strengthened with data. In the revised manuscript we will insert learning curves and per-stage success rates that compare direct task-level reward training against our staged curriculum, together with comparisons to the offline and local-reward baselines used in prior work. These additions will directly illustrate the convergence issues that motivated the hierarchical design. revision: yes
Referee: [Experiments and evaluation] No metrics are reported on self-correction frequency, exploration efficiency, or robustness to the 24,521 Chinese annotations that would demonstrate stage 3's contribution to 'Eureka' moments or rule out new convergence failures.

Authors: We thank the referee for highlighting this omission. The current evaluation emphasizes aggregate task success; we will augment the Experiments section with quantitative metrics including self-correction frequency (fraction of trajectories exhibiting successful recovery after an error), exploration efficiency (unique GUI states or actions visited per episode), and an analysis of performance robustness across the 24,521 Chinese annotations. These metrics will be reported both for the full curriculum and for ablated variants to isolate stage 3's role. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training recipe with external data

full rationale

The paper describes a three-stage hierarchical curriculum (format alignment, on-policy verifiable-action exploration, multi-turn task-level training) as an empirical training strategy that relies on external environment feedback, a newly contributed dataset of 24,521 Chinese GUI annotations across 28 apps, and a 500-trajectory benchmark. No equations, fitted parameters, or performance metrics are shown to reduce by construction to the target results; the central claim about bootstrapping exploration and self-correction is presented as an outcome of the training process evaluated on independent resources rather than a self-referential derivation. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way within the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical training recipe that relies on standard reinforcement learning assumptions about environment feedback and reward sparsity; no new mathematical axioms or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5821 in / 1193 out tokens · 34430 ms · 2026-05-19T08:22:14.243259+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-stage training process: format finetuning, action-level GRPO training, and task-level GRPO training
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-turn task-level training with realistic environment to unlock exploration and self-correction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
cs.AI 2026-02 unverdicted novelty 7.0

The work creates a new benchmark for humanizing GUI agent touch dynamics via a MinMax detector-agent model, a mobile touch dataset, and methods showing agents can match human behavior without losing task performance.
InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning
cs.AI 2025-08 unverdicted novelty 5.0

InquireMobile applies two-stage reinforcement fine-tuning and pre-action reasoning to VLM mobile agents, raising inquiry success rate by 46.8% on the introduced InquireBench benchmark.
A Survey of Reinforcement Learning for Large Reasoning Models
cs.CL 2025-09 accept novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 3 Pith papers · 12 internal anchors

[1]

Qwen2.5-VL Technical Report

Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923. Chen, L.; Li, L.; Zhao, H.; Song, Y .; and Vinci. 2025a. R1-V: Reinforcing Super Generalization Ability in Vision- Language Models with Less than $3. Accessed: 2025-02-

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Chen, L.; Li, L.; Zhao, H.; Song, Y .; Vinci; Kong, L.; Liu, Q.; and Chang, B. 2025b. RLVR in Vision Language Mod- els: Findings, Questions and Directions. Notion Post. Chen, P.; Bu, P.; Wang, Y .; Wang, X.; Wang, Z.; Guo, J.; Zhao, Y .; Zhu, Q.; Song, J.; Yang, S.; Wang, J.; and Zheng, B. 2025c. CombatVLA: An Efficient Vision-Language- Action Model for C...

work page arXiv
[3]

arXiv:2502.11718

”See the World, Discover Knowledge”: A Chinese Factuality Evaluation for Large Vision Language Models. arXiv:2502.11718. Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al

work page arXiv
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Huang, X.; Liu, W.; Chen, X.; Wang, X.; Wang, H.; Lian, D.; Wang, Y .; Tang, R.; and Chen, E

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Understanding the planning of LLM agents: A survey

Understand- ing the planning of LLM agents: A survey. arXiv preprint arXiv:2402.02716. Li, J.; and Huang, K

work page internal anchor Pith review Pith/arXiv arXiv
[6]

A summary on gui agents with foundation models enhanced by reinforcement learning,

A Summary on GUI Agents with Foundation Models Enhanced by Reinforcement Learning. arXiv preprint arXiv:2504.20464. Li, W.; Bishop, W.; Li, A.; Rawles, C.; Campbell-Ajala, F.; Tyamagundlu, D.; and Riva, O. 2024a. On the effects of data scale on computer control agents. arXiv e-prints , arXiv–

work page arXiv
[7]

E.; Li, A.; Rawles, C.; Campbell-Ajala, F.; Tyamagundlu, D.; and Riva, O

Li, W.; Bishop, W. E.; Li, A.; Rawles, C.; Campbell-Ajala, F.; Tyamagundlu, D.; and Riva, O. 2024b. On the effects of data scale on ui control agents. Advances in Neural Infor- mation Processing Systems, 37: 92130–92154. Li, Y .; Zhang, C.; Yang, W.; Fu, B.; Cheng, P.; Chen, X.; Chen, L.; and Wei, Y . 2024c. Appagent v2: Ad- vanced agent for flexible mobi...

work page arXiv
[8]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Showui: One vision- language-action model for gui visual agent. In Proceedings of the Computer Vision and Pattern Recognition Conference, 19498–19508. Liu, Y .; Li, P.; Xie, C.; Hu, X.; Han, X.; Zhang, S.; Yang, H.; and Wu, F. 2025a. InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reason- ers. arXiv:2504.14239. Liu, Z.; Su...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2406.08451

Gui odyssey: A comprehensive dataset for cross-app gui navigation on mo- bile devices. arXiv preprint arXiv:2406.08451. Lu, Z.; Chai, Y .; Guo, Y .; Yin, X.; Liu, L.; Wang, H.; Xiao, H.; Ren, S.; Xiong, G.; and Li, H. 2025b. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620. Luo, R.; Wang, L.; He, ...

work page arXiv
[10]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents. arXiv:2504.10458. Nguyen, D.; Chen, J.; Wang, Y .; Wu, G.; Park, N.; Hu, Z.; Lyu, H.; Wu, J.; Aponte, R.; Xia, Y .; et al

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Gui agents: A survey,

Gui agents: A survey. arXiv preprint arXiv:2412.13501. OpenAI, R

work page arXiv
[12]

GPT-4 Technical Report

Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5). Peng, Y .; Wang, X.; Wei, Y .; Pei, J.; Qiu, W.; Jian, A.; Hao, Y .; Pan, J.; Xie, T.; Ge, L.; et al

work page internal anchor Pith review Pith/arXiv arXiv
[13]

arXiv preprint arXiv:2504.05599

Skywork r1v: Pio- neering multimodal reasoning with chain-of-thought. arXiv preprint arXiv:2504.05599. Putta, P.; Mills, E.; Garg, N.; Motwani, S.; Finn, C.; Garg, D.; and Rafailov, R

work page arXiv
[14]

Agent q: Advanced reasoning and learning for autonomous ai agents

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents. arXiv:2408.07199. Qin, Y .; Ye, Y .; Fang, J.; Wang, H.; Liang, S.; Tian, S.; Zhang, J.; Li, J.; Li, Y .; Huang, S.; et al

work page arXiv
[15]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

UI-TARS: Pioneering Automated GUI Interaction with Native Agents. arXiv preprint arXiv:2501.12326. Shao, Y .; Li, L.; Dai, J.; and Qiu, X

work page internal anchor Pith review Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2310.10158

Character- llm: A trainable agent for role-playing. arXiv preprint arXiv:2310.10158. Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y .; Wu, Y .; et al

work page arXiv
[17]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models. arXiv preprint arXiv:2402.03300. Sun, Q.; Cheng, K.; Ding, Z.; Jin, C.; Wang, Y .; Xu, F.; Wu, Z.; Jia, C.; Chen, L.; Liu, Z.; et al

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Os-genesis: Automating gui agent trajectory construction via reverse task synthesis

OS-Genesis: Au- tomating GUI Agent Trajectory Construction via Reverse Task Synthesis. arXiv preprint arXiv:2412.19723. Wang, J.; Xu, H.; Jia, H.; Zhang, X.; Yan, M.; Shen, W.; Zhang, J.; Huang, F.; and Sang, J. 2024a. Mobile- agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. arXiv preprint arXiv:2406.010...

work page arXiv
[19]

Mobile-agent-e: Self-evolving mobile assistant for complex tasks

Mobile-Agent-E: Self- Evolving Mobile Assistant for Complex Tasks. arXiv preprint arXiv:2501.11733. Wanyan, Y .; Zhang, X.; Xu, H.; and et al

work page arXiv
[20]

arXiv preprint arXiv:2506.04614

Look Be- fore You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation. arXiv preprint arXiv:2506.04614. Wu, Z.; Wu, Z.; Xu, F.; Wang, Y .; Sun, Q.; Jia, C.; Cheng, K.; Ding, Z.; Chen, L.; Liang, P. P.; et al

work page arXiv
[21]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218. Xu, Y .; Wang, Z.; Wang, J.; Lu, D.; Xie, T.; Saha, A.; Sa- hoo, D.; Yu, T.; and Xiong, C

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Aguvis: Unified pure vision agents for autonomous gui interaction. arXiv preprint arXiv:2412.04454. Yang, Y .; He, X.; Pan, H.; Jiang, X.; Deng, Y .; Yang, X.; Lu, H.; Yin, D.; Rao, F.; Zhu, M.; et al

work page internal anchor Pith review Pith/arXiv arXiv
[23]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

R1-onevision: Advancing generalized multimodal reasoning through cross- modal formalization. arXiv preprint arXiv:2503.10615. Yuan, S.; Song, K.; Chen, J.; Tan, X.; Shen, Y .; Kan, R.; Li, D.; and Yang, D

work page internal anchor Pith review Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2401.06201

Easytool: Enhancing llm- based agents with concise tool instruction. arXiv preprint arXiv:2401.06201. Zhang, C.; Yang, Z.; Liu, J.; Li, Y .; Han, Y .; Chen, X.; Huang, Z.; Fu, B.; and Yu, G. 2025a. Appagent: Multimodal agents as smartphone users. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 1–20. Zhang, Z.; Lu, Y .; Fu,...

work page arXiv 2025
[25]

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

R1-Zero’s” Aha Moment” in Visual Reasoning on a 2B Non-SFT Model. arXiv preprint arXiv:2503.05132. Limitation From the perspective of training strategy, this paper em- ploys both action-level and task-level rewards to guide the RL training process, allowing the agent to gradually improve its capabilities. However, exploring how to enhance perfor- mance us...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Followed

Specifically, we tracked the model’s tail success ratio as a function of the training steps. As illustrated in Figuree 13, the tail success ratio exhibits a consistent upward trend with the increase in steps. This finding suggests that a more prolonged phase of end-to-end exploration is beneficial for achieving final task success. Figure 13: Robustness An...

work page 2035