pith. sign in

arxiv: 2512.12634 · v3 · pith:QHECGYVLnew · submitted 2025-12-14 · 💻 cs.AI

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

Pith reviewed 2026-05-16 22:56 UTC · model grok-4.3

classification 💻 cs.AI
keywords mobile GUI agentsbenchmarking frameworkoffline evaluationmulti-path annotationsmodular analysishuman agreementAI agent evaluation
0
0 comments X

The pith

MobiBench provides a modular offline benchmark for mobile GUI agents that matches human evaluators at 94.72 percent agreement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current evaluation methods for mobile GUI agents either use single-path offline datasets that penalize valid alternative actions or rely on live online tests that lack scalability and reproducibility. MobiBench introduces multi-path annotations and a modular structure to overcome both issues in a fully offline setting. The framework decomposes agents into components for detailed analysis while preserving high agreement with human judgments. Experiments confirm it reaches 94.72 percent agreement, comparable to engineered online benchmarks, and surfaces insights on techniques, model scales, and design guidelines.

Core claim

MobiBench is the first modular and multi-path aware offline benchmarking framework for mobile GUI agents. It achieves 94.72 percent agreement with human evaluators on par with carefully engineered online benchmarks while retaining the scalability and reproducibility of static offline benchmarks, and it supports module-level analysis of agent performance.

What carries the argument

Multi-branch annotations paired with modular decomposition of agent pipelines that separate perception, reasoning, and action modules for independent scoring.

If this is right

  • Different agent techniques can be compared fairly without penalizing valid alternative paths.
  • Performance bottlenecks can be isolated to specific modules such as perception or planning.
  • Optimal module configurations can be identified across different model sizes.
  • Actionable guidelines emerge for building more capable and cost-efficient mobile GUI agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-path and modular structure could transfer to benchmarking GUI agents on web or desktop platforms.
  • Richer multi-path data might serve as improved training signals for agent models.
  • Widespread adoption would lower the cost and time of reliable agent evaluation, speeding iteration cycles.

Load-bearing premise

The multi-path annotations capture all valid alternative actions that human evaluators would accept without systematic omissions.

What would settle it

A direct comparison study that collects fresh human ratings on a held-out set of agent trajectories and measures whether MobiBench scores still reach at least 90 percent agreement.

Figures

Figures reproduced from arXiv: 2512.12634 by Byeongung Jo, Insik Shin, Jaeyoung Wi, Joo Hyung Lee, Sangeun Oh, Seungwoo Baek, Sunjae Lee, Tae Hoon Min, Youngmin Im.

Figure 1
Figure 1. Figure 1: Examples of single-path static dataset and multi-branch static dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Modular architecture of Mobile GUI Agents [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cost efficiency of different module combina [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Efficiency of latency incurring techniques [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Correlation between screen complexity and path diversity [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example techniques for screen parsing. B.1 Input B.1.1 A11y Tree. Android’s Accessibility framework enables extraction of the on-screen UI hierarchy by producing XML dumps that encode view structure, attributes, and interaction affordances. We use this mechanism to collect UI snapshots offline and construct a dataset containing serialized UI trees for each interaction state. During evaluation, the agent op… view at source ↗
read the original abstract

Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single path offline benchmarks or online live benchmarks. Offline benchmarks using static, single path annotated datasets unfairly penalize valid alternative actions, while online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench, the first modular and multi path aware offline benchmarking framework for mobile GUI agents that enables high fidelity, scalable, and reproducible evaluation entirely in offline settings. Our experiments demonstrate that MobiBench achieves 94.72 percent agreement with human evaluators, on par with carefully engineered online benchmarks, while preserving the scalability and reproducibility of static offline benchmarks. Furthermore, our comprehensive module level analysis uncovers several key insights, including a systematic evaluation of diverse techniques used in mobile GUI agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost efficient mobile agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces MobiBench, a modular multi-branch offline benchmark for mobile GUI agents. It claims to resolve the unfair penalization of valid alternative actions in single-path offline benchmarks and the poor scalability/reproducibility of online live benchmarks by providing multi-path annotations and component-wise evaluation, reporting 94.72% agreement with human evaluators while enabling module-level analysis of techniques, model scales, and design guidelines.

Significance. If the multi-path annotations prove comprehensive and the agreement metric robust, MobiBench would represent a meaningful advance by delivering scalable, reproducible offline evaluation that matches the fidelity of online benchmarks, while also supplying actionable module-level insights that could guide more efficient GUI agent design.

major comments (2)
  1. [§3 and §4.2] §3 (Benchmark Construction) and §4.2 (Human Agreement Evaluation): the 94.72% agreement claim is load-bearing for the central contribution, yet the manuscript provides insufficient detail on the procedure used to enumerate and validate the completeness of alternative paths (e.g., no quantitative coverage metric, no inter-annotator agreement on path exhaustiveness, and no explicit check for omitted error-recovery or navigation-order variants). This leaves open the possibility that agreement rates partly reflect annotation coverage rather than true behavioral equivalence.
  2. [§5.3] §5.3 (Module-Level Analysis): the reported breakdowns by module and model scale are presented without an ablation that isolates the effect of multi-path versus single-path scoring on per-module performance; without this, it is unclear whether the modular insights are driven by the multi-branch feature or would hold under conventional single-path evaluation.
minor comments (3)
  1. [§2] The related-work section (§2) omits several 2024 GUI-agent papers that also explore offline evaluation; adding them would strengthen positioning.
  2. [Figure 2] Figure 2 (benchmark pipeline) would benefit from explicit call-outs for the multi-branch merging step and the exact matching criteria used in path comparison.
  3. [§3.1] A few minor notation inconsistencies appear in the module-interface definitions (e.g., inconsistent use of M_i versus Module_i); a quick pass for uniformity would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of transparency and interpretability that we address below. We have prepared revisions to strengthen the manuscript on both points.

read point-by-point responses
  1. Referee: [§3 and §4.2] §3 (Benchmark Construction) and §4.2 (Human Agreement Evaluation): the 94.72% agreement claim is load-bearing for the central contribution, yet the manuscript provides insufficient detail on the procedure used to enumerate and validate the completeness of alternative paths (e.g., no quantitative coverage metric, no inter-annotator agreement on path exhaustiveness, and no explicit check for omitted error-recovery or navigation-order variants). This leaves open the possibility that agreement rates partly reflect annotation coverage rather than true behavioral equivalence.

    Authors: We agree that greater detail on path enumeration and validation is warranted to substantiate the agreement metric. In the revised manuscript we will expand §3 to describe our multi-annotator protocol, report quantitative coverage statistics (average paths per task and saturation curves), provide inter-annotator agreement figures specifically for path exhaustiveness, and document the systematic inclusion of error-recovery and navigation-order variants. These additions will clarify that the observed agreement reflects comprehensive annotation rather than incomplete coverage. revision: yes

  2. Referee: [§5.3] §5.3 (Module-Level Analysis): the reported breakdowns by module and model scale are presented without an ablation that isolates the effect of multi-path versus single-path scoring on per-module performance; without this, it is unclear whether the modular insights are driven by the multi-branch feature or would hold under conventional single-path evaluation.

    Authors: We concur that an ablation isolating multi-path versus single-path scoring is necessary to interpret the module-level findings. In the revised §5.3 we will add a direct comparison that recomputes all module and scale breakdowns under single-path scoring and contrasts the results with the multi-path evaluation. This will show whether the reported insights depend on the multi-branch annotations. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical agreement measured against independent human judgments

full rationale

The paper presents MobiBench as an empirical benchmarking framework and reports a 94.72% agreement rate with human evaluators. This rate is obtained by direct comparison to external human annotations rather than any fitted parameters, self-citations, or internal derivations. No equations, predictions, or first-principles claims appear in the provided text that reduce to inputs by construction. The multi-path annotation process is described as an engineering choice whose coverage is validated externally via human agreement, keeping the central result independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that human agreement validates the benchmark and that the constructed dataset covers representative tasks and paths. No free parameters are fitted to data in the reported results.

axioms (1)
  • domain assumption Human evaluators provide reliable ground truth for valid agent actions
    The 94.72% agreement metric depends on this assumption being true.
invented entities (1)
  • MobiBench modular multi-path benchmark no independent evidence
    purpose: To enable scalable offline evaluation with component analysis
    Newly introduced framework for mobile GUI agent testing.

pith-pipeline@v0.9.0 · 5555 in / 1162 out tokens · 28070 ms · 2026-05-16T22:56:01.658808+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.

  2. Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.

  3. RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

    cs.AI 2026-04 unverdicted novelty 7.0

    RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

  4. AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents

    cs.HC 2026-04 unverdicted novelty 6.0

    AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 3 Pith papers · 14 internal anchors

  1. [1]

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. 2025. Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents. arXiv:2504.00906 [cs.AI] https: //arxiv.org/abs/2504.00906

  2. [2]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  3. [3]

    Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A Plummer. 2021. Mobile app tasks with itera- tive feedback (motif): Addressing task feasibility in interactive visual environments.arXiv preprint arXiv:2104.08560(2021)

  4. [4]

    Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Shuai Ren, and Hongsheng Li. 2024. Amex: Android multi- annotation expo dataset for mobile gui agents.arXiv preprint arXiv:2407.17490(2024)

  5. [5]

    De Chezelles, Thibault Le Sellier, Sahar Omidi Shayegan, Lawrence Ke- unho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F Xu, Siva Reddy, Quentin Cappart, et al. 2024. The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467(2024)

  6. [6]

    Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, and Lili Qiu. 2025. Advancing mobile gui agents: A verifier-driven approach to practical deployment.arXiv preprint arXiv:2503.15937(2025)

  7. [7]

    Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Rui Yan, and Shuo Shang. 2024. Mobile- Bench: An Evaluation Benchmark for LLM-based Mobile Agents. arXiv:2407.00993 [cs.AI] https://arxiv.org/abs/2407.00993

  8. [8]

    Android Developers. [n. d.].AccessibilityService | API ref- erence. https://developer.android.com/reference/android/ accessibilityservice/AccessibilityService Accessed 2025-09-02

  9. [9]

    Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of- verification reduces hallucination in large language models.arXiv preprint arXiv:2309.11495(2023)

  10. [10]

    Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, James Liu, Atharva Gundawar, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, and Sumeet Motwani. 2025. REAL: Benchmarking Au- tonomous Agents on Deterministic Simulations o...

  11. [11]

    Haoqiang Kang, Juntong Ni, and Huaxiu Yao. 2023. Ever: Mitigating hallucination in large language models through real-time verification and rectification.arXiv preprint arXiv:2311.09114(2023)

  12. [12]

    Yi Kong, Dianxi Shi, Guoli Yang, Chenlin Huang, Xiaopeng Li, Songchang Jin, et al . 2025. MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation.arXiv preprint arXiv:2507.21953(2025)

  13. [13]

    Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, and Insik Shin. 2025. Safeguard- ing mobile gui agent via logic-based action verification.arXiv preprint arXiv:2503.18492(2025)

  14. [14]

    Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, and Insik Shin. 2025. VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verifi- cation. InProceedings of the 31st Annual International Conference on Mobile Computing and Networking(Kerry Hotel, Hong Kong, Hong Y. Im, B. Jo, et al., Young...

  15. [15]

    Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Ho- jun Choi, Steve Ko, Sangeun Oh, and Insik Shin. 2024. MobileGPT: Augmenting LLM with Human-like App Memory for Mobile Task Automation. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking(Washington D.C., DC, USA) (ACM MobiCom ’24). Association for Comput...

  16. [16]

    Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steven Y Ko, Sangeun Oh, and Insik Shin. 2023. Explore, se- lect, derive, and recall: Augmenting llm with human-like memory for mobile task automation.arXiv preprint arXiv:2312.03003(2023)

  17. [17]

    Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. 2024. Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation. arXiv preprint arXiv:2402.11941(2024)

  18. [18]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self- feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594

  19. [19]

    Fanglin Mo, Junzhe Chen, Haoxuan Zhu, and Xuming Hu. 2025. Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent. arXiv:2505.14141 [cs.AI] https://arxiv.org/abs/2505.14141

  20. [20]

    2025.GPT-5.1: A smarter, more conversational ChatGPT

    openai. 2025.GPT-5.1: A smarter, more conversational ChatGPT. openai. Retrieved 11 12, 2025 from https://openai.com/index/gpt-5-1/

  21. [21]

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

  22. [22]

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

  23. [23]

    2023.EasyOCR

    Qualcomm. 2023.EasyOCR. Qualcomm. Retrieved Nov 11, 2025 from https://aihub.qualcomm.com/models/easyocr

  24. [24]

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tya- magundlu, Timothy Lillicrap, and Oriana Riva. 2024. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. arXiv:2405.14573 [cs.AI] https://...

  25. [25]

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. 2023. Android in the wild: a large-scale dataset for android device control. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 2609, 21 pages

  26. [26]

    Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. 2024. Agentsquare: Automatic llm agent search in modular design space.arXiv preprint arXiv:2410.06153(2024)

  27. [27]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal rein- forcement learning.Advances in Neural Information Processing Systems 36 (2023), 8634–8652

  28. [28]

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314(2024)

  29. [29]

    Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. 2022. META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI.arXiv preprint arXiv:2205.11029(2022)

  30. [30]

    Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, and Zhibo Yang. 2024. Omniparser: A unified framework for text spotting key information extraction and table recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15641–15653

  31. [31]

    Bryan Wang, Gang Li, and Yang Li. 2023. Enabling Conversational In- teraction with Mobile UI using Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany)(CHI ’23). Association for Computing Machin- ery, New York, NY, USA, Article 432, 17 pages. doi:10.1145/3544548. 3580895

  32. [32]

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voy- ager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291 [cs.AI] https://arxiv.org/abs/2305.16291

  33. [33]

    Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi- agent collaboration.Advances in Neural Information Processing Systems 37 (2024), 2686–2710

  34. [34]

    Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, and Shoufa Chen. 2024. MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents. arXiv:2406.08184 [cs.AI] https://arxiv.org/abs/2406.08184

  35. [35]

    Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. 2025. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733 (2025)

  36. [36]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain- of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL] https://arxiv.org/abs/2201.11903

  37. [37]

    Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2023. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272(2023)

  38. [38]

    Hao Wen, Shizuo Tian, Borislav Pavlov, Wenjie Du, Yixuan Li, Ge Chang, Shanhui Zhao, Jiacheng Liu, Yunxin Liu, Ya-Qin Zhang, and Yuanchun Li. 2025. AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation. arXiv:2412.18116 [cs.AI] https://arxiv.org/abs/ 2412.18116

  39. [39]

    Jason Wu, Xiaoyi Zhang, Jeff Nichols, and Jeffrey P Bigham. 2021. Screen parsing: Towards reverse engineering of ui models from screen- shots. InThe 34th Annual ACM Symposium on User Interface Software and Technology. 470–483

  40. [40]

    Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, and Shuo Shang. 2024. Mobilevlm: A vision- language model for better intra-and inter-ui understanding.arXiv preprint arXiv:2409.14818(2024)

  41. [41]

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. 2024. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218(2024)

  42. [42]

    Xiaoxin Chen, Aojun Zhou, and Hongsheng Li

    Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, preprint. Xiaoxin Chen, Aojun Zhou, and Hongsheng Li. 2025. UI-Genie: A Self- Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents. arXiv:2505.21496 [cs.CL] https://arxiv.org/abs/2505.21496

  43. [43]

    Mingzhe Xing, Rongkai Zhang, Hui Xue, Qi Chen, Fan Yang, and Zhen Xiao. 2024. Understanding the Weakness of Large Lan- guage Model Agents within a Complex Android Environment. arXiv:2402.06596 [cs.AI] https://arxiv.org/abs/2402.06596

  44. [44]

    Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, and Bo An. 2025. Mobile- Bench-v2: A More Realistic and Comprehensive Benchmark for VLM- based Mobile Agents. arXiv:2505.11891 [cs.CL] https://arxiv.org/abs/ 2505.11891

  45. [45]

    Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. 2024. Android- Lab: Training and Systematic Benchmarking of Android Autonomous Agents. arXiv:2410.24024 [cs.AI] https://arxiv.org/abs/2410.24024

  46. [46]

    Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. 2024. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454(2024)

  47. [47]

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2023. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv:2310.11441 [cs.CV] https://arxiv. org/abs/2310.11441

  48. [48]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR)

  49. [49]

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. 2025. Mobile-Agent- v3: Fundamental Agents for GUI Automation. arXiv:2508.15144 [cs.AI] https://arxiv.org/abs/2508.15144

  50. [50]

    Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, and Xiang Bai. 2025. OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Pars- ing and Its Generality to Multimodal Large Language Models. arXiv:2502.16161 [cs.CV] https://arxiv.org/abs/2502.16161

  51. [51]

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2025. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–20

  52. [52]

    Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, and Kai Yu. 2024. Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction. arXiv:2305.08144 [cs.AI] https://arxiv.org/ abs/2305.08144

  53. [54]

    Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. 2024. Android in the zoo: Chain-of- action-thought for gui agents.arXiv preprint arXiv:2403.02713(2024)

  54. [55]

    Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, and Mengwei Xu. 2024. Llamatouch: A faithful and scalable testbed for mobile ui task automation. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–13

  55. [56]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625(2022). Y. Im, B. Jo, et al., Youngmin Im, Byeongung Jo, Jaeyoung Wi, Tae Hoon Min, Seungwoo Baek, Joo H...

  56. [61]

    Flights" … <div class=

    Button: text="Flights" … <div class="main-container"> <div class="header"> <button description="Messages"></button> <button description="Notifications"></button> </div> <div class="search-form"> … Text Image Annotation Hybrid (Text + Image) UI List Parsed HTML

  57. [62]

    Booking.com

    ImageView: text="Booking.com"

  58. [63]

    Messages

    Button: description="Messages"

  59. [64]

    Notifications

    Button: description="Notifications"

  60. [65]

    Button: text="Stays"

  61. [66]

    Button: text="Flights"

  62. [67]

    Car rental

    Button: text="Car rental"

  63. [68]

    Taxi” … <div class=

    Button: text="Taxi” … <div class="main-container"> <div class="header"> <button description="Messages"></button> <button description="Notifications"></button> </div> <div class="search-form"> <div class="tab-navigation"> <button text="Stays"></button> ... UI Caption Text Box ID 1: Booking.com Text Box ID 2: Stays Text Box ID 3: Flights … Icon Box ID 10: M...