MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization
Pith reviewed 2026-06-26 15:56 UTC · model grok-4.3
The pith
MobileForge adapts mobile GUI agents to new apps using only automatically generated data and hierarchical feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MobileForge is an annotation-free adaptation system for mobile GUI agents. It consists of MobileGym, which grounds task generation and rollout evaluation in real mobile app interaction, and Hierarchical Feedback-Guided Policy Optimization (HiFPO), which turns trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates. Using only automatically generated adaptation data, it adapts base models to strong performance on AndroidWorld and out-of-domain splits.
What carries the argument
Hierarchical Feedback-Guided Policy Optimization (HiFPO), which converts automatically generated trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates.
If this is right
- Base MLLMs can reach performance near closed-data GUI-specialized models on AndroidWorld using only automatic adaptation data.
- The adapted ForgeOwl-8B achieves 77.6% Pass@3 on AndroidWorld and 41.0% success on the out-of-domain MobileWorld GUI-only split.
- Adaptation becomes feasible for many apps because MobileGym supplies the full substrate of task generation, rollout, and feedback without manual labels.
- Policy optimization shifts from isolated rollouts and coarse rewards to hint-contextualized step-level GRPO updates.
Where Pith is reading between the lines
- The same automatic feedback loop could be tested on non-mobile GUI environments or web agents where app state is similarly observable.
- If step-level hints prove reliable, the approach might reduce reliance on expensive human preference data in other agent training pipelines.
- Scaling MobileGym across more apps could create large open adaptation datasets that further close the gap to closed models.
Load-bearing premise
Trajectory outcomes, step-level process feedback, and corrective hints can be generated automatically inside MobileGym and converted into reliable step-level GRPO improvement signals without human-written tasks, demonstrations, or reward labels.
What would settle it
An experiment that replaces MobileGym's automatic feedback with equivalent human-generated step-level hints and measures whether GRPO updates produce measurably weaker or stronger policy gains on the same base model and test split.
read the original abstract
MLLM-based mobile GUI agents have made substantial progress in UI understanding and action execution, but adapting them to real target apps remains costly because mobile apps are numerous, frequently updated, and hard to cover with human-written tasks, demonstrations, or reward labels. Existing annotation-free GUI learning reduces manual supervision, yet lacks a unified substrate connecting target-app exploration, curriculum mining, rollout execution, and feedback, while policy optimization often relies on isolated rollouts and coarse rewards that are hard to convert into reliable improvement signals. We present MobileForge, an annotation-free adaptation system for mobile GUI agents. MobileForge consists of MobileGym, which grounds task generation and rollout evaluation in real mobile app interaction, and Hierarchical Feedback-Guided Policy Optimization (HiFPO), which turns trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates. Using only automatically generated annotation-free adaptation data, MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld, close to the closed-data GUI-specialized GUI-Owl-1.5-8B base model at 69.0%. The MobileForge-adapted ForgeOwl-8B further reaches 77.6% Pass@3 on AndroidWorld and 41.0% success on the out-of-domain MobileWorld GUI-only split, establishing the strongest open-data mobile GUI agent in our evaluation. Code, data, and trained models will be released at https://mobile-forge.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MobileForge, an annotation-free adaptation system for MLLM-based mobile GUI agents. It consists of MobileGym, which grounds task generation and rollout evaluation in real mobile app interactions, and Hierarchical Feedback-Guided Policy Optimization (HiFPO), which converts automatically generated trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates. Using only such data, the system adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld (close to the 69.0% of the closed-data GUI-Owl-1.5-8B), with the further-adapted ForgeOwl-8B reaching 77.6% Pass@3 on AndroidWorld and 41.0% success on the out-of-domain MobileWorld GUI-only split.
Significance. If the automatic feedback mechanisms prove reliable and the performance gains are reproducible, the work would be a meaningful advance in scalable GUI agent adaptation by eliminating the need for human-written tasks, demonstrations, or reward labels across numerous and frequently updated apps. The planned public release of code, data, and trained models is a clear strength that supports verification and extension by the community.
major comments (2)
- [Method] The method overview provides no concrete description or pseudocode for how trajectory outcomes, step-level process feedback, and corrective hints are automatically extracted inside MobileGym and converted into reliable GRPO improvement signals; this mechanism is load-bearing for the central annotation-free claim and the reported Pass@3 numbers.
- [Experiments] No details are given on the GRPO update implementation, hyperparameter choices for HiFPO, or experimental controls (e.g., data generation protocol, ablation of feedback types); without these, the support for the performance claims on AndroidWorld and MobileWorld cannot be verified.
Simulated Author's Rebuttal
We thank the referee for their constructive review. The comments correctly identify areas where the manuscript would benefit from expanded technical detail to support verification of the annotation-free claims. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Method] The method overview provides no concrete description or pseudocode for how trajectory outcomes, step-level process feedback, and corrective hints are automatically extracted inside MobileGym and converted into reliable GRPO improvement signals; this mechanism is load-bearing for the central annotation-free claim and the reported Pass@3 numbers.
Authors: We agree that the current description in Section 3 is high-level and lacks the requested concrete details and pseudocode. In the revision we will add an expanded subsection with (1) the precise algorithms used inside MobileGym to derive trajectory outcomes, step-level process feedback, and corrective hints from real app interactions, and (2) pseudocode showing how these signals are formatted into hint-contextualized step-level GRPO updates. This will make the load-bearing annotation-free pipeline explicit. revision: yes
-
Referee: [Experiments] No details are given on the GRPO update implementation, hyperparameter choices for HiFPO, or experimental controls (e.g., data generation protocol, ablation of feedback types); without these, the support for the performance claims on AndroidWorld and MobileWorld cannot be verified.
Authors: We concur that additional implementation and control details are needed for reproducibility. The revised manuscript will include a new experimental appendix or subsection specifying the exact GRPO formulation and update rule, all HiFPO hyperparameters (learning rate, KL coefficient, feedback weighting, etc.), the full data-generation protocol, and ablation results isolating each feedback type. These additions will directly support the AndroidWorld and MobileWorld numbers. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical system (MobileGym + HiFPO) whose central claims are measured Pass@3 and success rates on external benchmarks (AndroidWorld, MobileWorld). No equations, derivations, or self-citations are shown to reduce these reported outcomes to quantities fitted inside the same loop or to rename inputs as predictions; the adaptation results are presented as measured consequences of the described process on held-out tasks.
Axiom & Free-Parameter Ledger
free parameters (1)
- HiFPO and GRPO hyperparameters
axioms (1)
- domain assumption Automatically generated step-level feedback and corrective hints constitute valid training signals for GRPO updates in GUI tasks
invented entities (2)
-
MobileGym
no independent evidence
-
HiFPO
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[2]
Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao, Qiushi Sun, Zichen Ding, Fangzhi Xu, Hang Yan, Jiajun Chen, et al. OpenMobile: Building open mobile agents with task and trajectory synthesis.arXiv preprint arXiv:2604.15093, 2026
Pith/arXiv arXiv 2026
-
[3]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
Pith/arXiv arXiv 2025
-
[4]
Ui-venus-1.5 technical report.arXiv e-prints, pages arXiv–2602, 2026
Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, et al. Ui-venus-1.5 technical report.arXiv e-prints, pages arXiv–2602, 2026
2026
-
[5]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024
2024
-
[6]
Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments.arXiv preprint, 2026
Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, et al. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments.arXiv preprint, 2026
2026
-
[7]
Llm-powered gui agents in phone automation: Surveying progress and prospects
Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, et al. Llm-powered gui agents in phone automation: Surveying progress and prospects. arXiv preprint arXiv:2504.19838, 2025. 16
arXiv 2025
-
[8]
Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025
arXiv 2025
-
[9]
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620, 2025
Pith/arXiv arXiv 2025
-
[10]
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024
Pith/arXiv arXiv 2024
-
[11]
Deepseekmath: Pushing the limits of mathematical reasoning in open language models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
Pith/arXiv arXiv 2024
-
[12]
Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025
arXiv 2025
-
[13]
Os-genesis: Automating gui agent trajectory construction via reverse task synthesis
Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, et al. Os-genesis: Automating gui agent trajectory construction via reverse task synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5555–5579, 2025
2025
-
[14]
Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025
arXiv 2025
-
[15]
Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. ClawGUI: A unified framework for training, evaluating, and deploying gui agents.arXiv preprint arXiv:2604.11784, 2026
Pith/arXiv arXiv 2026
-
[16]
Wenhao Wang, Mengying Yuan, Zijie Yu, Guangyi Liu, Rui Ye, Tian Jin, Siheng Chen, and Yanfeng Wang. Mobilea3gent: Training mobile gui agents using decentralized self-sourced data from diverse users.arXiv preprint arXiv:2502.02982, 2025
arXiv 2025
-
[17]
Mengzhou Wu, Yuzhe Guo, Yuan Cao, Haochuan Lu, Songhe Zhu, Pingzhe Qu, Xin Chen, Kang Qin, Zhongpu Wang, Xiaode Zhang, et al. Ui-oceanus: Scaling gui agents with synthetic environmental dynamics.arXiv preprint arXiv:2604.02345, 2026
Pith/arXiv arXiv 2026
-
[18]
Bin Xie, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Jie Liu, Min Zhang, and Liqiang Nie. Gui-explorer: Autonomous exploration and mining of transition-aware knowledge for gui agent.arXiv preprint arXiv:2505.16827, 2025
arXiv 2025
-
[19]
Mobile-agent-v3.5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3.5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026
arXiv 2026
-
[20]
Tianci Xue, Zeyi Liao, Tianneng Shi, Zilu Wang, Kai Zhang, Dawn Song, Yu Su, and Huan Sun. Autonomous continual learning of computer-use agents for environment adaptation.arXiv preprint arXiv:2602.10356, 2026
Pith/arXiv arXiv 2026
-
[21]
Zerogui: Automating online gui learning at zero human cost.arXiv preprint arXiv:2505.23762, 2025
Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, et al. Zerogui: Automating online gui learning at zero human cost.arXiv preprint arXiv:2505.23762, 2025
arXiv 2025
-
[22]
Tongui: Building generalized gui agents by learning from multimodal web tutorials.arXiv e-prints, pages arXiv–2504, 2025
Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, and Qing Li. Tongui: Building generalized gui agents by learning from multimodal web tutorials.arXiv e-prints, pages arXiv–2504, 2025
2025
-
[23]
Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025. 17 A Detailed Related Work This appendix expands the concise related-work discussion in Section 1. We organize prior wo...
arXiv 2025
-
[24]
EVALUATE the original task for reasonableness and completion
-
[25]
GENERATE new diverse curriculum tasks that comprehensively cover the app's functionality. ## App Information App Name: {app_name} Original Task Goal: {original_goal} ## Few-shot Examples {fewshot_examples} ## Task Generation Principles {task_principles} ## Already Generated Tasks for {app_name} {existing_tasks} IMPORTANT: Do not generate tasks that are to...
-
[26]
Reasonableness Assessment: - Is this a reasonable task that a user might actually want to perform in this app? - Are the requirements clear and achievable? - Does the task make sense in the context of the app?
-
[27]
- A reasonable step logically progresses toward task completion
Step-by-Step Quality Analysis: - Analyze representative visible steps in the screenshot sequence. - A reasonable step logically progresses toward task completion. - An unreasonable step is unnecessary, wrong, counterproductive, stuck in a loop, or moves backward unnecessarily. - Failed trajectories may contain reasonable steps. 22 - Successful trajectorie...
-
[28]
evaluation
Overall Completion Assessment: - Did the agent complete the stated task? - Were the required steps performed correctly? - Did the agent reach the intended goal state? ### Step 2: Curriculum Task Generation Generate 3-8 new learning tasks that: - cover different core functionalities of {app_name}; - vary in length from 1 to 40 steps; - are pedagogically us...
-
[29]
Decide whether the attempt completed the task
-
[30]
Assess whether the task itself is feasible
-
[31]
If failed, identify the failure_step
-
[32]
decision
Analyze every step for reasonableness and provide a concise rationale. Return JSON: { "decision": 1 or 0, "reason": "Explanation of the final judgment", "failure_step": 4, "task_feasible": true/false, "task_feasible_reason": "Why the task is feasible or infeasible", "task_barriers": [], "reasonable_steps": [1, 2, 4], "unreasonable_steps": [3], "step_analy...
-
[33]
Identify key mistakes, especially unreasonable steps
-
[34]
Specify what to avoid in future attempts
-
[35]
Propose concrete alternative approaches
-
[36]
key_mistake
Extract important task insights. Return JSON: { "key_mistake": "Concise summary of the main mistake", "what_to_avoid": ["..."], "suggested_approach": ["..."], "important_insights": ["..."], "hint_summary": "Brief self-reminder for the next attempt" } G.3 Hint-Guided Rollout Prompts During HiFPO rollout, the hint contextη<k is appended to the task instruct...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.