pith. the verified trust layer for science. sign in

arxiv: 2506.20332 · v4 · submitted 2025-06-25 · 💻 cs.AI

Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training

Pith reviewed 2026-05-19 08:22 UTC · model grok-4.3

classification 💻 cs.AI
keywords mobile agentsvision-language modelsreinforcement learninghierarchical trainingGUI interactionself-correctionexplorationChinese dataset
0
0 comments X p. Extension

The pith

A three-stage hierarchical curriculum trains vision-language models to explore mobile screens and correct their own errors more effectively than prior offline methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that mobile agents built on vision-language models often stall in local optima when trained only on fixed data or immediate action rewards. It proposes Mobile-R1, a staged recipe that first aligns reasoning format, then uses on-policy exploration with verifiable feedback, and finally applies multi-turn task rewards inside a real environment. This progression is claimed to unlock sustained exploration and self-correction during complex GUI tasks. The authors support the work with a new Chinese mobile dataset of 24,521 annotations across 28 apps and a 500-trajectory benchmark. If the claim holds, agents could move from brittle step-by-step execution toward more autonomous handling of multi-step mobile interactions.

Core claim

Mobile-R1 bridges atomic action execution and strategic task completion through a hierarchical curriculum of three stages: format alignment for reasoning structure, on-policy exploration with verifiable action feedback to ground basic execution, and multi-turn task-level training inside a realistic environment. This sequence bootstraps the agent and produces improved exploration together with self-correction capabilities that prior offline or local-reward approaches lacked.

What carries the argument

The three-stage hierarchical curriculum that progressively moves from reasoning alignment to grounded on-policy exploration to full task-level training with environment feedback.

If this is right

  • Agents gain the ability to recover from mistakes during long GUI sequences rather than repeating failed actions.
  • The method reduces reliance on large amounts of pre-collected offline trajectories for effective training.
  • A publicly released Chinese GUI dataset and benchmark enable evaluation of agents in non-English mobile ecosystems.
  • The approach can be applied to other VLM-based agents that interact with structured screen environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar staged curricula could be tested on web or desktop interfaces where feedback is also sparse.
  • Combining the method with larger base models might further amplify the observed self-correction gains.
  • The open-sourced resources could serve as a shared testbed for comparing interactive training recipes across research groups.

Load-bearing premise

On-policy exploration with verifiable action feedback plus multi-turn task training in a realistic environment will produce better exploration and error correction than offline or local-reward training without introducing new convergence or overfitting problems.

What would settle it

If agents trained with the full three-stage curriculum show no reduction in local-optima failures compared with agents trained only on the first two stages when evaluated on the 500-trajectory benchmark, the added value of the final multi-turn stage would be falsified.

Figures

Figures reproduced from arXiv: 2506.20332 by Bo Zheng, Jihao Gu, Jingxuan Xing, Jun Song, Ming-Liang Zhang, Pi Bu, Qihang Ai, Wei Jiang, Yingxiu Zhao, Yingyao Wang, Yuning Jiang, Zekun Zhu, Ziming Wang.

Figure 1
Figure 1. Figure 1: Comparison of Action-Level and Task-Level Rewards. Figure (a) illustrates an action-induced agent that is encouraged [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of data collection. with R1-V exploring Reinforcement Learning with Veri￾fiable Reward (RLVR) to boost VLMs’ visual reasoning. It demonstrated that RLVR methods exhibit strong out-of￾distribution generalization, whereas SFT excels in training domain tasks (Chen et al. 2025b). Trajectory Dataset To support the community’s efforts in training and evaluat￾ing powerful agents, we first created a high-… view at source ↗
Figure 3
Figure 3. Figure 3: Our training framework consists of three stages: initial format finetuning, online training via action-level reward, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of trajectory length. tion Trajectory Annotation, 2) Single-step GRPO training with action-level rewards to enhance the format compliance and click accuracy, and 3) Online training with task-level rewards based on multi-turn trajectories to improve the gen￾eralization ability. Preliminary Task Definition Mobile agents are driven by textual in￾structions, understand screenshots of mobile screen… view at source ↗
Figure 5
Figure 5. Figure 5: Reward score during Stage3 training [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of reasoning trajectories between Mobile-R1 and Qwen2.5-VL-3B-Instruct on the task “ [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pass K performance of Mobile-R1. To further assess the generalization capabilities of Mobile-R1, we validated our method’s impact on end-to￾end task performance by evaluating the success rate across 50 random tasks on physical devices. As detailed in Fiure 8, the model exhibited a consistent improvement trajectory. A minor initial dip in performance was observed during Stage 1, attributable to the action-l… view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for Instruction Generation. Prompt for Agent Exploration You are a mobile GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. You are provided with function signatures within <tools></tools> XML tags: <tools> { "name": "mobile_use", "arguments": { "type": "function", "function": { "name_for_human": "mobile_use", "name":… view at source ↗
Figure 10
Figure 10. Figure 10: A.4 Prompts of Training System This prompt was designed to ensure consistency with the Qwen2.5-VL’s GUI prompt. The bond indicates the instruc￾tion, which can be replaced by any instruction of our dataset. Prompts of Training System You are a mobile GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. You are provided with fun… view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of the thinking processes of the Stage2 model and Stage3 model of Mobile-R1 under the same task [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Robustness Analysis on Unseen Applications. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Robustness Analysis on Unseen Applications. [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative result of Mobile-R1 under the task “ [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative result of Mobile-R1 under the task “ [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
read the original abstract

Vision-language model-based mobile agents have gained the ability to understand complex instructions and mobile screenshots, benefiting from reinforcement learning paradigms like Group Relative Policy Optimization (GRPO). However, existing approaches centers on offline training or local action-level rewards often trap agents in local optima, hindering effective exploration and error correction with the environment. Crucially, we find that directly applying task-level rewards often leads to convergence difficulties due to the sparse nature of GUI interactions. To address these challenges, we present \textbf{Mobile-R1}, a systematic training recipe that bridges atomic action execution and strategic task completion. We propose a hierarchical curriculum consisting of three stages: (1) format alignment for reasoning structure, (2) on-policy exploration with verifiable action feedback to ground basic execution, and (3) multi-turn task-level training with realistic environment to unlock exploration and self-correction. This hierarchical strategy effectively bootstraps the agent, significantly enhancing its capability for exploration and self-correction (the ``Eureka'' moments). Furthermore, addressing the critical scarcity of diverse GUI data in non-English ecosystems, we contribute a comprehensive Chinese mobile dataset covering 28 applications with 24,521 high-quality manual annotations, and establish a rigorous benchmark with 500 trajectories. We will open source all resources, including the dataset, benchmark, model weight, and codes: https://mobile-r1.github.io/Mobile-R1/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Mobile-R1, a hierarchical training recipe for VLM-based mobile agents consisting of three stages: (1) format alignment for reasoning structure, (2) on-policy exploration with verifiable action feedback to ground basic execution, and (3) multi-turn task-level training with realistic environment feedback. It claims this curriculum addresses limitations of offline training and local action-level rewards (local optima, sparse GUI interactions causing convergence issues) and unlocks improved exploration and self-correction ('Eureka' moments). The work also contributes a Chinese mobile GUI dataset covering 28 applications with 24,521 high-quality manual annotations and establishes a benchmark with 500 trajectories, with all resources to be open-sourced.

Significance. If the empirical results and ablations confirm that the three-stage curriculum produces measurably superior exploration and error-correction behavior without introducing instability or dataset-specific overfitting, the work would provide a practical, reproducible training recipe for interactive mobile agents and help address data scarcity in non-English GUI settings. The open-sourcing of the dataset, benchmark, model weights, and code would further strengthen its utility for the community.

major comments (3)
  1. [Abstract and method description] The central claim that the hierarchical strategy 'effectively bootstraps the agent, significantly enhancing its capability for exploration and self-correction' is load-bearing yet unsupported by any quantitative results, ablation studies isolating stage 3, training dynamics, variance metrics, or error analysis on the 500-trajectory benchmark. The abstract and method description provide no evidence that multi-turn task-level training reliably avoids the convergence difficulties or overfitting risks noted for direct task-level rewards.
  2. [Introduction and proposed approach] The assertion that 'directly applying task-level rewards often leads to convergence difficulties due to the sparse nature of GUI interactions' is presented without supporting data such as learning curves, success rates across stages, or comparisons to prior offline/local-reward baselines. This leaves the motivation for the specific three-stage ordering unverified.
  3. [Experiments and evaluation] No metrics are reported on self-correction frequency, exploration efficiency, or robustness to the 24,521 Chinese annotations that would demonstrate stage 3's contribution to 'Eureka' moments or rule out new convergence failures.
minor comments (2)
  1. [Method] A diagram or pseudocode outlining the transitions, reward signals, and data flow between the three stages would improve clarity of the curriculum.
  2. [Implementation details] The manuscript should explicitly state the base VLM, exact hyperparameters for each stage, and how on-policy exploration is implemented to allow reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential utility of our hierarchical training recipe and the Chinese GUI dataset. We agree that the central claims regarding improved exploration and self-correction require stronger quantitative support, and we will revise the manuscript accordingly to address the identified gaps in evidence.

read point-by-point responses
  1. Referee: [Abstract and method description] The central claim that the hierarchical strategy 'effectively bootstraps the agent, significantly enhancing its capability for exploration and self-correction' is load-bearing yet unsupported by any quantitative results, ablation studies isolating stage 3, training dynamics, variance metrics, or error analysis on the 500-trajectory benchmark. The abstract and method description provide no evidence that multi-turn task-level training reliably avoids the convergence difficulties or overfitting risks noted for direct task-level rewards.

    Authors: We acknowledge that the abstract and method sections would benefit from more explicit quantitative backing for the load-bearing claims. While the full paper reports overall benchmark improvements from the full curriculum, we agree that isolating stage 3's contribution and providing supporting analyses are necessary. In the revision we will add: (i) ablation studies comparing the three-stage curriculum against a two-stage variant (omitting multi-turn task-level training), (ii) training dynamics and variance metrics across multiple runs on the 500-trajectory benchmark, and (iii) an error analysis quantifying self-correction events. We will also include a brief discussion of observed convergence behavior under direct task-level rewards versus the staged approach. revision: yes

  2. Referee: [Introduction and proposed approach] The assertion that 'directly applying task-level rewards often leads to convergence difficulties due to the sparse nature of GUI interactions' is presented without supporting data such as learning curves, success rates across stages, or comparisons to prior offline/local-reward baselines. This leaves the motivation for the specific three-stage ordering unverified.

    Authors: We agree that empirical motivation for the three-stage ordering should be strengthened with data. In the revised manuscript we will insert learning curves and per-stage success rates that compare direct task-level reward training against our staged curriculum, together with comparisons to the offline and local-reward baselines used in prior work. These additions will directly illustrate the convergence issues that motivated the hierarchical design. revision: yes

  3. Referee: [Experiments and evaluation] No metrics are reported on self-correction frequency, exploration efficiency, or robustness to the 24,521 Chinese annotations that would demonstrate stage 3's contribution to 'Eureka' moments or rule out new convergence failures.

    Authors: We thank the referee for highlighting this omission. The current evaluation emphasizes aggregate task success; we will augment the Experiments section with quantitative metrics including self-correction frequency (fraction of trajectories exhibiting successful recovery after an error), exploration efficiency (unique GUI states or actions visited per episode), and an analysis of performance robustness across the 24,521 Chinese annotations. These metrics will be reported both for the full curriculum and for ablated variants to isolate stage 3's role. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training recipe with external data

full rationale

The paper describes a three-stage hierarchical curriculum (format alignment, on-policy verifiable-action exploration, multi-turn task-level training) as an empirical training strategy that relies on external environment feedback, a newly contributed dataset of 24,521 Chinese GUI annotations across 28 apps, and a 500-trajectory benchmark. No equations, fitted parameters, or performance metrics are shown to reduce by construction to the target results; the central claim about bootstrapping exploration and self-correction is presented as an outcome of the training process evaluated on independent resources rather than a self-referential derivation. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way within the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical training recipe that relies on standard reinforcement learning assumptions about environment feedback and reward sparsity; no new mathematical axioms or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5821 in / 1193 out tokens · 34430 ms · 2026-05-19T08:22:14.243259+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

    cs.AI 2026-02 unverdicted novelty 7.0

    The work creates a new benchmark for humanizing GUI agent touch dynamics via a MinMax detector-agent model, a mobile touch dataset, and methods showing agents can match human behavior without losing task performance.

  2. InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

    cs.AI 2025-08 unverdicted novelty 5.0

    InquireMobile applies two-stage reinforcement fine-tuning and pre-action reasoning to VLM mobile agents, raising inquiry success rate by 46.8% on the introduced InquireBench benchmark.

  3. A Survey of Reinforcement Learning for Large Reasoning Models

    cs.CL 2025-09 accept novelty 3.0

    A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 3 Pith papers · 12 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923. Chen, L.; Li, L.; Zhao, H.; Song, Y .; and Vinci. 2025a. R1-V: Reinforcing Super Generalization Ability in Vision- Language Models with Less than $3. Accessed: 2025-02-

  2. [2]

    Chen, L.; Li, L.; Zhao, H.; Song, Y .; Vinci; Kong, L.; Liu, Q.; and Chang, B. 2025b. RLVR in Vision Language Mod- els: Findings, Questions and Directions. Notion Post. Chen, P.; Bu, P.; Wang, Y .; Wang, X.; Wang, Z.; Guo, J.; Zhao, Y .; Zhu, Q.; Song, J.; Yang, S.; Wang, J.; and Zheng, B. 2025c. CombatVLA: An Efficient Vision-Language- Action Model for C...

  3. [3]

    arXiv:2502.11718

    ”See the World, Discover Knowledge”: A Chinese Factuality Evaluation for Large Vision Language Models. arXiv:2502.11718. Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Huang, X.; Liu, W.; Chen, X.; Wang, X.; Wang, H.; Lian, D.; Wang, Y .; Tang, R.; and Chen, E

  5. [5]

    Understanding the planning of LLM agents: A survey

    Understand- ing the planning of LLM agents: A survey. arXiv preprint arXiv:2402.02716. Li, J.; and Huang, K

  6. [6]

    A summary on gui agents with foundation models enhanced by reinforcement learning,

    A Summary on GUI Agents with Foundation Models Enhanced by Reinforcement Learning. arXiv preprint arXiv:2504.20464. Li, W.; Bishop, W.; Li, A.; Rawles, C.; Campbell-Ajala, F.; Tyamagundlu, D.; and Riva, O. 2024a. On the effects of data scale on computer control agents. arXiv e-prints , arXiv–

  7. [7]

    E.; Li, A.; Rawles, C.; Campbell-Ajala, F.; Tyamagundlu, D.; and Riva, O

    Li, W.; Bishop, W. E.; Li, A.; Rawles, C.; Campbell-Ajala, F.; Tyamagundlu, D.; and Riva, O. 2024b. On the effects of data scale on ui control agents. Advances in Neural Infor- mation Processing Systems, 37: 92130–92154. Li, Y .; Zhang, C.; Yang, W.; Fu, B.; Cheng, P.; Chen, X.; Chen, L.; and Wei, Y . 2024c. Appagent v2: Ad- vanced agent for flexible mobi...

  8. [8]

    InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

    Showui: One vision- language-action model for gui visual agent. In Proceedings of the Computer Vision and Pattern Recognition Conference, 19498–19508. Liu, Y .; Li, P.; Xie, C.; Hu, X.; Han, X.; Zhang, S.; Yang, H.; and Wu, F. 2025a. InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reason- ers. arXiv:2504.14239. Liu, Z.; Su...

  9. [9]

    arXiv preprint arXiv:2406.08451

    Gui odyssey: A comprehensive dataset for cross-app gui navigation on mo- bile devices. arXiv preprint arXiv:2406.08451. Lu, Z.; Chai, Y .; Guo, Y .; Yin, X.; Liu, L.; Wang, H.; Xiao, H.; Ren, S.; Xiong, G.; and Li, H. 2025b. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620. Luo, R.; Wang, L.; He, ...

  10. [10]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents. arXiv:2504.10458. Nguyen, D.; Chen, J.; Wang, Y .; Wu, G.; Park, N.; Hu, Z.; Lyu, H.; Wu, J.; Aponte, R.; Xia, Y .; et al

  11. [11]

    Gui agents: A survey,

    Gui agents: A survey. arXiv preprint arXiv:2412.13501. OpenAI, R

  12. [12]

    GPT-4 Technical Report

    Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5). Peng, Y .; Wang, X.; Wei, Y .; Pei, J.; Qiu, W.; Jian, A.; Hao, Y .; Pan, J.; Xie, T.; Ge, L.; et al

  13. [13]

    arXiv preprint arXiv:2504.05599

    Skywork r1v: Pio- neering multimodal reasoning with chain-of-thought. arXiv preprint arXiv:2504.05599. Putta, P.; Mills, E.; Garg, N.; Motwani, S.; Finn, C.; Garg, D.; and Rafailov, R

  14. [14]

    Agent q: Advanced reasoning and learning for autonomous ai agents

    Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents. arXiv:2408.07199. Qin, Y .; Ye, Y .; Fang, J.; Wang, H.; Liang, S.; Tian, S.; Zhang, J.; Li, J.; Li, Y .; Huang, S.; et al

  15. [15]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents. arXiv preprint arXiv:2501.12326. Shao, Y .; Li, L.; Dai, J.; and Qiu, X

  16. [16]

    arXiv preprint arXiv:2310.10158

    Character- llm: A trainable agent for role-playing. arXiv preprint arXiv:2310.10158. Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y .; Wu, Y .; et al

  17. [17]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models. arXiv preprint arXiv:2402.03300. Sun, Q.; Cheng, K.; Ding, Z.; Jin, C.; Wang, Y .; Xu, F.; Wu, Z.; Jia, C.; Chen, L.; Liu, Z.; et al

  18. [18]

    Os-genesis: Automating gui agent trajectory construction via reverse task synthesis

    OS-Genesis: Au- tomating GUI Agent Trajectory Construction via Reverse Task Synthesis. arXiv preprint arXiv:2412.19723. Wang, J.; Xu, H.; Jia, H.; Zhang, X.; Yan, M.; Shen, W.; Zhang, J.; Huang, F.; and Sang, J. 2024a. Mobile- agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. arXiv preprint arXiv:2406.010...

  19. [19]

    Mobile-agent-e: Self-evolving mobile assistant for complex tasks

    Mobile-Agent-E: Self- Evolving Mobile Assistant for Complex Tasks. arXiv preprint arXiv:2501.11733. Wanyan, Y .; Zhang, X.; Xu, H.; and et al

  20. [20]

    arXiv preprint arXiv:2506.04614

    Look Be- fore You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation. arXiv preprint arXiv:2506.04614. Wu, Z.; Wu, Z.; Xu, F.; Wang, Y .; Sun, Q.; Jia, C.; Cheng, K.; Ding, Z.; Chen, L.; Liang, P. P.; et al

  21. [21]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218. Xu, Y .; Wang, Z.; Wang, J.; Lu, D.; Xie, T.; Saha, A.; Sa- hoo, D.; Yu, T.; and Xiong, C

  22. [22]

    Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    Aguvis: Unified pure vision agents for autonomous gui interaction. arXiv preprint arXiv:2412.04454. Yang, Y .; He, X.; Pan, H.; Jiang, X.; Deng, Y .; Yang, X.; Lu, H.; Yin, D.; Rao, F.; Zhu, M.; et al

  23. [23]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    R1-onevision: Advancing generalized multimodal reasoning through cross- modal formalization. arXiv preprint arXiv:2503.10615. Yuan, S.; Song, K.; Chen, J.; Tan, X.; Shen, Y .; Kan, R.; Li, D.; and Yang, D

  24. [24]

    arXiv preprint arXiv:2401.06201

    Easytool: Enhancing llm- based agents with concise tool instruction. arXiv preprint arXiv:2401.06201. Zhang, C.; Yang, Z.; Liu, J.; Li, Y .; Han, Y .; Chen, X.; Huang, Z.; Fu, B.; and Yu, G. 2025a. Appagent: Multimodal agents as smartphone users. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 1–20. Zhang, Z.; Lu, Y .; Fu,...

  25. [25]

    R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

    R1-Zero’s” Aha Moment” in Visual Reasoning on a 2B Non-SFT Model. arXiv preprint arXiv:2503.05132. Limitation From the perspective of training strategy, this paper em- ploys both action-level and task-level rewards to guide the RL training process, allowing the agent to gradually improve its capabilities. However, exploring how to enhance perfor- mance us...

  26. [26]

    Followed

    Specifically, we tracked the model’s tail success ratio as a function of the training steps. As illustrated in Figuree 13, the tail success ratio exhibits a consistent upward trend with the increase in steps. This finding suggests that a more prolonged phase of end-to-end exploration is beneficial for achieving final task success. Figure 13: Robustness An...