Xiaomi-GUI-0 Technical Report
Pith reviewed 2026-07-01 06:05 UTC · model grok-4.3
The pith
A real-device-dominant hybrid infrastructure lets a multimodal GUI agent reach 72% success on mobile tasks while raising execution stability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Xiaomi-GUI-0 is a multimodal GUI agent whose defining feature is a real-device-dominant hybrid infrastructure that keeps physical phones as the primary execution environment while using sandboxes only for auxiliary support. This infrastructure ensures that data collection, rollout, and evaluation share a state distribution close to real deployment. The model is trained on multi-source trajectories augmented by an error-driven data flywheel that converts failure traces into corrected actions, reflective explanations, and recovery demonstrations, then refined through a progressive pipeline of supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning.
What carries the argument
real-device-dominant hybrid infrastructure that places physical phones as the primary execution environment and sandboxes in auxiliary support so that training and evaluation distributions match real deployment
If this is right
- Higher execution stability when the agent encounters permission dialogs, payment flows, and risk controls in live applications.
- Continuous improvement loop in which real failure trajectories are automatically turned into reflective training data without additional human labeling.
- Better recognition and recovery from abnormal states through the combination of reflection data and agentic reinforcement learning.
- Gains observed on both public benchmarks and the in-house RealMobile set indicate that aligning execution distribution reduces the benchmark-to-reality gap.
Where Pith is reading between the lines
- The same hybrid setup could be adapted to other platforms such as tablets or desktop environments if physical devices remain the primary source of state variation.
- Reducing reliance on purely simulated environments may lower the cost of developing future GUI agents while increasing their robustness to live variability.
- Incorporating user-specific account states during the data flywheel stage could further narrow the gap between training and personalized deployment.
- The three-stage training progression may generalize to other agentic tasks where reflection and recovery are critical.
Load-bearing premise
The state distribution produced by physical phones as the main execution environment is close enough to live deployment that benchmark improvements will carry over to production use with varying accounts, permissions, and risk controls.
What would settle it
A substantial drop in task success rate when the agent is run on production devices that include diverse user accounts, active permission dialogs, and risk-control screens absent from the training distribution.
read the original abstract
Graphical user interface (GUI) agents build on vision-language models to complete user tasks end-to-end in real applications through interface actions such as tapping, swiping, text entry, and navigation. However, existing GUI agents are trained and evaluated largely on offline trajectories, simulated environments, and standardized benchmarks. These differ substantially from real applications in interface layout, interaction logic, and abnormal-state distribution, and cannot faithfully characterize execution stability in real-world use, where account states, permission dialogs, payment authentication, and risk control continually reshape the state distribution and open a persistent gap between benchmark scores and real usability. To close this gap, we propose Xiaomi-GUI-0, a native multimodal GUI agent for real mobile environments, trained and evaluated within a real-device closed loop. At its core is a real-device-dominant hybrid infrastructure, where physical devices are the primary execution environment and sandboxes provide auxiliary support, so that data collection, training, rollout, and evaluation share an execution distribution close to real deployment. We construct multi-source training data spanning high-frequency head tasks, high-generalization data for long-tail intents, and capability-enhancement data for reflection and memory, and introduce an error-driven data flywheel that turns failure trajectories into corrected actions, reflective explanations, and recovery demonstrations. The model is trained through a progressive three-stage pipeline of supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning. Evaluated on public benchmarks and our in-house RealMobile, Xiaomi-GUI-0 achieves 72.0% success on RealMobile and 78.9% on AndroidWorld, while substantially improving execution stability and abnormal-state recognition in real-world tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Xiaomi-GUI-0, a native multimodal GUI agent for real mobile environments. It describes a real-device-dominant hybrid infrastructure (physical phones primary, sandboxes auxiliary) for data collection, training, rollout, and evaluation; multi-source training data spanning head tasks, long-tail intents, and capability-enhancement data; an error-driven data flywheel that converts failure trajectories into corrected actions and reflective explanations; and a progressive three-stage training pipeline (supervised fine-tuning, step-level reinforcement learning, agentic reinforcement learning). The central empirical claim is that the resulting model achieves 72.0% success on the in-house RealMobile benchmark and 78.9% on AndroidWorld while substantially improving execution stability and abnormal-state recognition.
Significance. If the performance claims hold under rigorous controls, the work would be significant for GUI agent research by demonstrating a closed-loop system whose execution distribution is intended to match real deployment more closely than offline or simulated benchmarks, potentially narrowing the persistent gap between benchmark scores and practical usability.
major comments (3)
- [Evaluation] Evaluation section: success rates of 72.0% on RealMobile and 78.9% on AndroidWorld are stated without baselines, error bars, dataset sizes, number of evaluation episodes, exclusion criteria, or statistical tests, rendering the central performance claim unsupported by evidence in the text.
- [Infrastructure] Infrastructure section: the claim that the real-device-dominant hybrid infrastructure produces a state distribution close enough to real deployment (including account states, permission dialogs, payment flows, and risk controls) is load-bearing for the reported gains but is asserted without quantitative validation such as distribution statistics, KL divergence, or ablation isolating the hybrid component.
- [Results] Results and abstract: no quantitative metrics or measurement protocol are supplied for the claimed improvements in 'execution stability' and 'abnormal-state recognition,' preventing assessment of whether these are meaningful or reproducible.
minor comments (2)
- [Abstract] The term 'RealMobile' is introduced without an explicit definition or pointer to its construction details.
- [Abstract] The phrase 'substantially improving' is used without accompanying numbers or comparison tables.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below, indicating where the manuscript will be revised to strengthen the evidence presented.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: success rates of 72.0% on RealMobile and 78.9% on AndroidWorld are stated without baselines, error bars, dataset sizes, number of evaluation episodes, exclusion criteria, or statistical tests, rendering the central performance claim unsupported by evidence in the text.
Authors: We agree that the evaluation section requires additional supporting details. In the revised manuscript we will add comparisons against published baselines on AndroidWorld, report the number of evaluation episodes and task categories for both benchmarks, include error bars from repeated runs where available, describe exclusion criteria for RealMobile, and report statistical significance where the data permit. For the proprietary RealMobile benchmark we will provide summarized rather than exhaustive episode-level statistics. revision: partial
-
Referee: [Infrastructure] Infrastructure section: the claim that the real-device-dominant hybrid infrastructure produces a state distribution close enough to real deployment (including account states, permission dialogs, payment flows, and risk controls) is load-bearing for the reported gains but is asserted without quantitative validation such as distribution statistics, KL divergence, or ablation isolating the hybrid component.
Authors: The hybrid infrastructure is central to the work. We will revise the infrastructure section to include quantitative state-distribution statistics (e.g., frequency of permission dialogs and abnormal states) comparing the real-device-dominant setup against sandbox-only runs. While full KL divergence on high-dimensional GUI states is impractical, we will add an ablation that isolates the contribution of real-device data to the final performance where feasible. revision: yes
-
Referee: [Results] Results and abstract: no quantitative metrics or measurement protocol are supplied for the claimed improvements in 'execution stability' and 'abnormal-state recognition,' preventing assessment of whether these are meaningful or reproducible.
Authors: We agree that quantitative metrics are needed. In the revision we will define and report concrete metrics for execution stability (e.g., recovery success rate across account-state variations) and abnormal-state recognition (e.g., precision of error detection), together with the exact measurement protocols used during evaluation. revision: yes
Circularity Check
No circularity; purely empirical system description with no derivations or fitted predictions
full rationale
The paper is a technical report describing an empirical GUI agent system, hybrid infrastructure, multi-source data collection, error-driven flywheel, and three-stage training pipeline, followed by benchmark results (72.0% RealMobile, 78.9% AndroidWorld). No equations, first-principles derivations, parameter fitting, or predictions appear. No self-citations are load-bearing for any claimed result. The central claims rest on direct evaluation rather than any reduction to inputs by construction. This matches the default expectation for non-circular empirical reports.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet
Anthropic. Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet. Model card addendum, 2024. URL https://assets.anthropic.com/m/1cd9d098ac3e6467/original/ Claude-3-Model-Card-October-Addendum.pdf
2024
-
[2]
Introducing claude opus 4.6
Anthropic. Introducing claude opus 4.6. Anthropic announcement, 2026. URL https://www.anthropic.com/news/ claude-opus-4-6
2026
-
[3]
Introducing claude opus 4.7
Anthropic. Introducing claude opus 4.7. Anthropic announcement, 2026. URL https://www.anthropic.com/news/ claude-opus-4-7
2026
-
[4]
Digirl: Training in- the-wild device-control agents with autonomous reinforcement learning
Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and A viral Kumar. Digirl: Training in- the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems, 37:12461–12495, 2024
2024
-
[5]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Seed2.0 model card: Towards intelligence frontier for real-world complexity
ByteDance Seed. Seed2.0 model card: Towards intelligence frontier for real-world complexity. arXiv preprint arXiv:2603.11103, 2026. URL https://arxiv.org/abs/2603.11103
-
[7]
KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, et al. Knowu-bench: Towards interactive, proactive, and personalized mobile agent evaluation. arXiv preprint arXiv:2604.08455 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
Step: Success-rate-aware trajectory- efficient policy optimization
Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. Step: Success-rate-aware trajectory- efficient policy optimization. arXiv preprint arXiv:2511.13091 , 2025
-
[9]
Gui-shift: Enhancing vlm-based gui agents through self-supervised reinforcement learning
Longxi Gao, Li Zhang, Pengzhi Gao, Wei Liu, Jian Luan, and Mengwei Xu. Gui-shift: Enhancing vlm-based gui agents through self-supervised reinforcement learning. arXiv preprint arXiv:2505.12493 , 2025
-
[10]
Gemini 3.1 pro model card
Google DeepMind. Gemini 3.1 pro model card. Model card, 2026. URL https://deepmind.google/models/ model-cards/gemini-3-1-pro/
2026
-
[11]
Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training
Jihao Gu, Qihang Ai, Yingyao Wang, Pi Bu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Ziming Wang, Yingxiu Zhao, Ming-Liang Zhang, et al. Mobile-r1: Towards interactive reinforcement learning for vlm-based mobile agent via task-level rewards. arXiv preprint arXiv:2506.20332 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
arXiv preprint arXiv:2508.10833 , year=
Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833 , 2025. 26
-
[13]
Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training, 2025. URL https://arxiv.org/abs/2507.01663
-
[14]
Mo- bileipl: Enhancing mobile agents thinking process via iterative preference learning
Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, and Bo An. Mo- bileipl: Enhancing mobile agents thinking process via iterative preference learning. arXiv preprint arXiv:2505.12299 , 2025
-
[15]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Screenspot-pro: Gui grounding for professional high-resolution computer use
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. In Proceedings of the 33rd ACM International Conference on Multimedia , pages 8778–8786, 2025
2025
-
[17]
Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, and Shu-Tao Xia. From verbatim to gist: Distilling pyramidal multimodal memory via semantic information bottleneck for long-horizon video agents. arXiv preprint arXiv:2603.01455 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis
Guohong Liu, Jialei Ye, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, and Yuanchun Li. Simuwob: Simulating real-world mobile apps for fast and faithful gui agent benchmarking. arXiv preprint arXiv:2605.25160 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc
Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Chang- sheng Xu, Weiming Hu, et al. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc. arXiv preprint arXiv:2502.14282 , 2025
-
[20]
Autoglm: Autonomous foundation agents for guis
Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. Autoglm: Autonomous foundation agents for guis. arXiv preprint arXiv:2411.00820 , 2024
-
[21]
Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, pages 17608–17616, 2026
2026
-
[22]
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision- language action model for gui agents. arXiv preprint arXiv:2504.10458 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Addendum to openai o3 and o4-mini system card: Openai o3 operator
OpenAI. Addendum to openai o3 and o4-mini system card: Openai o3 operator. System card addendum, 2025. URL https://openai.com/index/o3-o4-mini-system-card-addendum-operator-o3/
2025
-
[24]
Toolllm: Facilitating large language models to master 16000+ real-world apis
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. In International Conference on Learning Representations, volume 2024, pages 9695–9717, 2024
2024
-
[25]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents, 2025. URL https://arxiv. org/abs/2501.12326, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation
Heng Qu, Yike Liu, Renren Jin, Wenzong Zhang, Pengzhi Gao, Wei Liu, and Jian Luan. Scaling, benchmarking, and reasoning of vision-language agents for mobile gui navigation. arXiv preprint arXiv:2605.27134 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Androidworld: A dynamic benchmarking environment for au- tonomous agents
Chris Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for au- tonomous agents. In International Conference on Learning Representations , volume 2025, pages 406–441, 2025
2025
-
[28]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems , 36:68539–68551, 2023
2023
-
[29]
Ui-tars-1.5
Seed. Ui-tars-1.5. ByteDance Seed Blog, 2025. URL https://seed-tars.com/1.5/
2025
-
[30]
Bytedance Seed. Seed1. 8 model card: Towards generalized real-world agency. arXiv preprint arXiv:2603.20633 , 2026. 27
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems , pages 1279–1297, 2025
2025
-
[32]
Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks
Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, and Qing Li. Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks. arXiv preprint arXiv:2510.26098, 2025
-
[33]
arXiv preprint arXiv:2507.05720 , year=
Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment. arXiv preprint arXiv:2507.05720, 2025
-
[34]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[35]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
arXiv preprint arXiv:2602.09082 , year=
Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, et al. Ui-venus-1.5 technical report. arXiv preprint arXiv:2602.09082 , 2026
-
[37]
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, et al. Cua-gym: Scaling verifiable training environments and tasks for computer-use agents. arXiv preprint arXiv:2605.25624 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. Advances in Neural Information Processing Systems , 37:2686–2710, 2024
2024
-
[40]
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Opencua: Open foundations for computer-use agents
Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Wu, et al. Opencua: Open foundations for computer-use agents. Advances in Neural Information Processing Systems, 38:139756–139806, 2026
2026
-
[43]
Mmbench-gui: A unified hierarchical evaluation framework for multi-platform gui agents
Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. Mmbench-gui: A unified hierarchical evaluation framework for multi-platform gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6239–6248, 2026
2026
-
[44]
OpenClaw-RL: Train Any Agent Simply by Talking
Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXiv preprint arXiv:2603.10165 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
Dingbang Wu, Rui Hao, Haiyang Wang, Shuzhe Wu, Han Xiao, Zhenghong Li, Bojiang Zhou, Zheng Ju, Zichen Liu, Lue Fan, et al. Mobilegym: A verifiable and highly parallel simulation platform for mobile gui agent research. arXiv preprint arXiv:2605.26114 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[46]
Backtrackagent: Enhancing gui agent with error detection and backtracking mechanism
Qinzhuo Wu, Pengzhi Gao, Wei Liu, and Jian Luan. Backtrackagent: Enhancing gui agent with error detection and backtracking mechanism. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4250–4272, 2025
2025
-
[47]
Qinzhuo Wu, Zhizhuo Yang, Hanhao Li, Pengzhi Gao, Wei Liu, and Jian Luan. Mobilebench-ol: A comprehensive chinese benchmark for evaluating mobile gui agents in real-world environment. arXiv preprint arXiv:2601.20335 , 2026. 28
-
[48]
Os-atlas: Foundation action model for generalist gui agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: Foundation action model for generalist gui agents. In International Conference on Learning Representations , volume 2025, pages 5090–5108, 2025
2025
-
[49]
Scaling computer-use grounding via user interface decomposition and synthesis
Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, et al. Scaling computer-use grounding via user interface decomposition and synthesis. Advances in Neural Information Processing Systems , 38, 2026
2026
-
[50]
Gui-pra: Process reward agent for gui tasks
Tao Xiong, Xavier Hu, Yurun Chen, Yuhang Liu, Changqiao Wu, Pengzhi Gao, Wei Liu, Jian Luan, and Shengyu Zhang. Gui-pra: Process reward agent for gui tasks. arXiv preprint arXiv:2509.23263 , 2025
-
[51]
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents. arXiv preprint arXiv:2602.16855 , 2026
-
[52]
Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks
Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li, Bin Wang, and Bo An. Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks. In The Fourteenth International Conference on Learning Representations , 2026
2026
-
[53]
Step-gui technical report, 2025
Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, et al. Step-gui technical report. arXiv preprint arXiv:2512.15431 , 2025
-
[54]
Mobile-Agent-v3: Fundamental Agents for GUI Automation
Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation, 2025. URL https://arxiv. org/abs/2508.15144, 4:21–27, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025. URL https://arxiv. org/abs/2503.14476, 1:2, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Sglang: Efficient execution of structured language model programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems , 37:62557–62583, 2024
2024
-
[58]
arXiv preprint arXiv:2512.22047 , year=
Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047, 2025. 29 Contributions and Acknowledgments All contributors are listed in alphabetical order by their last names. Core Contr...
-
[59]
Current device type & foreground app
-
[60]
Output the corresponding JSON string inside `<tool_call>`.,→
Current screenshot # Available Tools You MUST pick exactly one tool per step. Output the corresponding JSON string inside `<tool_call>`.,→
-
[61]
name": "Tap
Tap: `{"name": "Tap", "position": [x, y], "times": 1}` (Tap at coordinate)
-
[62]
name": "LongPress
LongPress: `{"name": "LongPress", "position": [x, y]}` (Trigger contextual menus)
-
[63]
name": "Swipe
Swipe: `{"name": "Swipe", "start_position": [x1, y1], "end_position": [x2, y2]}` (Swipe to scroll/move. Swipe up to scroll down),→
-
[64]
name": "Type
Type: `{"name": "Type", "position": [x, y], "text": "..."}` (Tap input box and type)
-
[65]
name": "Search
Search: `{"name": "Search", "position": [x, y], "text": "..."}` (Macro: tap -> clear -> type -> submit),→
-
[66]
name": "Open
Open: `{"name": "Open", "app": "..."}` (Launch app via system)
-
[67]
name": "Back
Back: `{"name": "Back"}` (System-level back)
-
[68]
name": "Home
Home: `{"name": "Home"}` (Go to home screen)
-
[69]
name": "Wait
Wait: `{"name": "Wait"}` (Wait for page loading/rendering)
-
[70]
name": "Request
Request: `{"name": "Request", "text": "..."}` (Ask user for clarification/confirmation)
-
[71]
name": "Fail
Fail: `{"name": "Fail", "type": "...", "reason": "..."}` (Report failure. `<TYPE>` MUST be one of: LOGIN_REQUIRED, USE_GUIDANCE, CAPTCHA_VERIFICATION, RESULT_NOT_FOUND, BLUETOOTH_CONNECTION_REQUIRED, NETWORK_ERROR, PAYMENT_AUTHENTICATION, TASK_CANT_FULFILLED, REPEAT_OPERATION, PERMISSION_REQUEST, PASSWORD_REQUIRED, TAKEOVER_EXIT, TEMPORARY_TAKEOVER, MANUA...
-
[72]
name": "Complete
Complete: `{"name": "Complete"}` (Confirm goal reached for non-Q&A tasks)
-
[73]
name": "Speak
Speak: `{"name": "Speak", "text": "..."}` (Present final answer for Q&A tasks) # Operational Constraints
-
[74]
Top-left is (0, 0); bottom-right is (1, 1).,→
Coordinate system: every `position` is a relative [x, y] in [0, 1] with 3-decimal precision. Top-left is (0, 0); bottom-right is (1, 1).,→
-
[75]
Dismiss unrelated pop-ups (ads, upgrade prompts, rating requests) by tapping their Close / Skip / X / "Later" button rather than calling Fail.,→
-
[76]
If self-correction fails, call Fail
Loop breaker: if three consecutive steps cause no visible change, or the same action is repeating in a loop, self-correct (try Back or a different target). If self-correction fails, call Fail. ,→ ,→ # Reasoning Framework (inside <think>) Before emitting the action, reason inside `<think>...</think>` (omit steps if no new info):
-
[77]
[Observation]: Objectively describe the current App, page state, and key visible elements
-
[78]
Explain what was expected vs
[Reflection]: (Optional) Include ONLY if the current screen deviates from the previous plan's expectation. Explain what was expected vs. what is actually seen.,→
-
[79]
Output a 2-4 step path in a single line separated by `|`
[Plan] / [Plan Update] / [Replan]: (Choose one). Output a 2-4 step path in a single line separated by `|`. Mark completed steps with `[done]` and the current step with `->`. Use [Replan] if the previous plan failed. ,→ ,→
-
[80]
[Decision]: Deduce the exact action based on the Observation and the current `->` step in the Plan.,→
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.