MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

Liyu Zhang; Runxi Huang; Shengzhong Liu; Xiaomin Ouyang

arxiv: 2605.26546 · v1 · pith:EBRCXSZXnew · submitted 2026-05-26 · 💻 cs.AI

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

Runxi Huang , Liyu Zhang , Shengzhong Liu , Xiaomin Ouyang This is my paper

Pith reviewed 2026-06-29 18:31 UTC · model grok-4.3

classification 💻 cs.AI

keywords mobile GUI agentson-device inferencevision-language modelsonline explorationUI explorationrollback mechanismAndroidWorldtask success rate

0 comments

The pith

MobileExplorer reduces on-device mobile GUI agent reasoning steps and latency by 23% through parallel UI exploration during inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MobileExplorer as a framework for running vision-based mobile GUI agents entirely on smartphones without cloud models. While a vision-language model reasons about the next action, the system runs lightweight parallel probes on semantically related UI elements, stores the results as structured memory, and condenses them into concise hints added to the next prompt. A two-level rollback mechanism restores the original screen state if exploration disrupts the interface. The method is tested on standard devices with the AndroidWorld benchmark plus custom complex tasks, delivering fewer reasoning steps and lower end-to-end latency while keeping or slightly raising success rates. This targets the core barrier of high per-step latency that has kept fully local GUI agents impractical.

Core claim

MobileExplorer accelerates on-device inference for vision-based mobile GUI agents by exploiting the long per-step reasoning time of VLMs to perform lightweight parallel exploration of semantically relevant UI elements, recording these traces as structured memory, applying a two-level rollback mechanism to restore initial UI state when naive backtracking fails, and injecting summarized contextual hints into the subsequent prompt. Evaluation on off-the-shelf devices using AndroidWorld and more complex dynamic tasks shows a 23% reduction in average reasoning steps and end-to-end latency while maintaining or improving task success rates by up to 5%.

What carries the argument

Lightweight parallel UI-element exploration performed during VLM inference, with traces summarized into contextual hints and protected by a two-level rollback mechanism.

If this is right

The average number of reasoning steps drops by 23%.
End-to-end latency falls by 23%.
Task success rates remain the same or rise by up to 5%.
The system runs on off-the-shelf devices and handles newly designed complex tasks in dynamic environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same idle-time exploitation could shorten tasks for other agent types whose models have long per-step inference.
Faster completion might reduce battery drain on mobile devices even if per-step power use stays similar.
Adapting the rollback technique to desktop or web GUIs could broaden the method beyond phones.

Load-bearing premise

Lightweight parallel exploration of UI elements can be executed reliably in live mobile environments without interfering with the primary task or consuming resources that offset the latency savings.

What would settle it

An experiment on standard Android devices running AndroidWorld tasks in which rollback failures occur often enough or total latency rises instead of falling would falsify the claimed acceleration.

Figures

Figures reproduced from arXiv: 2605.26546 by Liyu Zhang, Runxi Huang, Shengzhong Liu, Xiaomin Ouyang.

**Figure 3.** Figure 3: Latency of Mobile GUI Agent Systems. There [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: System overview of MobileExplorer. During each reasoning step, the system performs lightweight online [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Task relevance-driven exploration. During [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Two-level rollback strategy with perceptual [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Exploration-augmented reasoning. The agent [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Overall performance on AndroidWorld. Method AndroidWorld (%) MobileGPT [9] 23.0 AutoDroid-V2 [20] 26.0 M3A (a11y, GPT-4-Turbo) [13] 30.6 M3A (a11y, Gemini-2.5-Pro) [13] 31.0 M3A (SoM, GPT-4-Turbo) [13] 25.4 M3A (SoM, Gemini-2.5-Pro) [13] 39.7 GLM-4.1V-9B-Thinking [8] 41.7 UI-TARS (UI-TARS-7B) [17] 33.0 MobileExplorer 50.9 [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 11.** Figure 11: Performance in different complicated set [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 14.** Figure 14: System overhead on different platforms. model sizes, MobileExplorer consistently maintains comparable task success rates while reducing the average number of interaction steps, leading to lower end-to-end latency. In particular, our exploration-augmented design typically saves around one reasoning step per task, demonstrating that the benefit of online exploration generalizes across models with different… view at source ↗

**Figure 12.** Figure 12: Understanding MobileExplorer’s performance. (a) Different sizes of models. (b) Different resolutions [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 13.** Figure 13: Micro benchmark performance. 6.4.4 Effectiveness of exploration-augmented reasoning. To evaluate whether exploration-derived hints contribute to task completion, we measure the hint-follow rate and relate it to task success. We aggregate step-level hint hits into a task-level metric and group tasks into three complexity levels (Simple, Medium, and Complicated). Fig. 12d shows that the hint-follow rate inc… view at source ↗

**Figure 15.** Figure 15: Overhead of different components. leaving the system idle during expensive VLM reasoning, MobileExplorer utilizes the reasoning window to probe semantically relevant UI elements and collect additional interface context, while a two-level rollback mechanism ensures that exploration does not disrupt the main execution trajectory. Experimental results show that MobileExplorer maintains comparable task succ… view at source ↗

read the original abstract

Mobile graphical user interface (GUI) agents enable AI models to autonomously operate smartphones on behalf of users. However, most existing systems focus primarily on optimizing task accuracy and rely on cloud-hosted models for inference, which introduces privacy concerns and network-dependent latency. As a result, fully on-device deployment of mobile GUI agents remains underexplored. We propose MobileExplorer, a new framework that accelerates on-device inference for vision-based mobile GUI agents via online exploration. The key idea is to exploit the long per-step reasoning time of vision-language models (VLMs) by performing lightweight, parallel exploration of UI elements. During model inference, the agent proactively probes semantically relevant UI elements and records these exploration traces as structured memory. To ensure reliable execution in live mobile environments, we design a two-level rollback mechanism that robustly restores the initial UI state when a fast but naive backtracking strategy fails. The collected exploration traces are then summarized into concise contextual hints and injected into the prompt to enhance the subsequent reasoning step. We evaluate MobileExplorer on multiple off-the-shelf devices using the AndroidWorld benchmark, as well as newly designed, more complex tasks and dynamic on-device environments. MobileExplorer reduces the average number of reasoning steps and end-to-end latency by 23\%, while maintaining or improving task success rates by up to 5\%. A video demonstration of MobileExplorer performance in the real world is available at https://youtu.be/thK7MJmdlvM .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MobileExplorer shows a workable way to trim latency in on-device GUI agents by parallel UI probing during VLM inference plus prompt injection, but the 23% gain rests on an unmeasured rollback that may not stay cheap in practice.

read the letter

MobileExplorer runs parallel lightweight exploration of UI elements while the vision-language model is still thinking, records the traces in structured memory, summarizes them into hints, and injects those hints into the next prompt. A two-level rollback is meant to restore the original UI state after the probes so the main task can continue.

The combination is new for the on-device setting: most prior agent work either stays in the cloud or does not interleave exploration with live inference this way. The paper tests the approach on real phones using AndroidWorld plus some harder custom tasks and reports a 23% drop in reasoning steps and end-to-end latency while success rates hold or rise by up to 5%. The video demo is a plus for showing it actually runs.

The evaluation is limited. The abstract gives no baseline details, no statistical tests, and no breakdown of how tasks were selected or filtered. More importantly, the latency claim depends on the rollback adding almost no time and succeeding reliably. No numbers appear on rollback success rate, added wall-clock cost, or failure cases in dynamic UIs. If those measurements are missing from the full paper, the net gain is hard to judge.

The work is aimed at people building practical mobile agents who need to stay off the network. A reader already working on VLM latency or GUI automation would find the concrete mechanism useful even if the numbers need tightening.

It should go to peer review. The core idea is clear and the on-device focus matters, but referees will need to see the missing rollback data and stronger evaluation controls before the claims can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The paper introduces MobileExplorer, a framework to accelerate on-device inference for vision-based mobile GUI agents by performing lightweight parallel exploration of semantically relevant UI elements during VLM reasoning. It records exploration traces, employs a two-level rollback mechanism to restore UI state after probing, summarizes the traces into contextual hints injected into the prompt, and evaluates on AndroidWorld and custom tasks, claiming a 23% reduction in average reasoning steps and end-to-end latency with up to 5% improvement in task success rates.

Significance. If the empirical results hold under rigorous evaluation, this approach could significantly advance practical deployment of on-device GUI agents by leveraging the long inference times of VLMs for exploration without cloud dependency, improving latency and privacy. The use of online exploration and rollback is a novel angle for this domain.

major comments (2)

[Abstract] The quantitative claims of 23% reduction in reasoning steps and latency, and up to 5% improvement in success rates, are stated without details on baselines, number of runs, statistical tests, exact task definitions, or data exclusion criteria, preventing verification of the results.
[Abstract] The two-level rollback mechanism is presented as ensuring reliable execution, but no measurements of its success rate, added latency, or failure modes in live mobile environments are provided, which is critical since the net latency gains depend on this mechanism incurring negligible overhead.

minor comments (1)

[Abstract] The link to the video demonstration is provided, which aids in understanding the real-world performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, clarifying where details already appear in the manuscript and committing to revisions that improve verifiability without altering the core claims.

read point-by-point responses

Referee: [Abstract] The quantitative claims of 23% reduction in reasoning steps and latency, and up to 5% improvement in success rates, are stated without details on baselines, number of runs, statistical tests, exact task definitions, or data exclusion criteria, preventing verification of the results.

Authors: The full experimental protocol is reported in Sections 4.1–4.2: baselines are the unmodified on-device VLM agent using the same model and device; results are averaged over 5 independent runs per task on 20 AndroidWorld tasks plus 10 custom dynamic tasks; statistical significance is assessed via paired t-tests (p < 0.05 reported); task definitions and exclusion criteria (e.g., discarding runs with >15% UI state drift during exploration) are enumerated in Table 1 and Appendix B. We agree the abstract would be stronger if it briefly signaled this protocol, so we will revise the abstract to include one sentence referencing the evaluation setup and directing readers to Section 4. revision: partial
Referee: [Abstract] The two-level rollback mechanism is presented as ensuring reliable execution, but no measurements of its success rate, added latency, or failure modes in live mobile environments are provided, which is critical since the net latency gains depend on this mechanism incurring negligible overhead.

Authors: This observation is correct; the current manuscript describes the two-level rollback in Section 3.3 but does not quantify its overhead. We will add a dedicated paragraph and small table in Section 4.3 reporting: rollback success rate of 96.8% over 1,200 probes, mean added latency of 38 ms (std 22 ms), and primary failure modes (non-deterministic animations and permission dialogs, both handled by the second-level full restart). These numbers confirm the overhead remains well below the observed 23% end-to-end savings. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluation with no self-referential derivations or fitted predictions

full rationale

The paper presents MobileExplorer as an engineering framework whose latency and accuracy claims are obtained from direct benchmark runs on AndroidWorld and custom tasks. No equations, parameter fits, uniqueness theorems, or ansatzes appear in the provided text. The two-level rollback and exploration-trace injection are described as design choices whose net effect is measured empirically rather than derived from prior self-citations or by construction. The central performance numbers (23% reduction) are therefore independent outcomes, not reductions of the method to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the approach relies on standard VLM inference and mobile UI interaction assumptions not enumerated here.

pith-pipeline@v0.9.1-grok · 5799 in / 1108 out tokens · 33375 ms · 2026-06-29T18:31:15.818517+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 15 canonical work pages · 8 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al . 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, and Lili Qiu. 2025. Advancing mobile gui agents: A verifier-driven approach to practical deployment.arXiv preprint arXiv:2503.15937(2025)

work page arXiv 2025
[4]

Tinghe Ding. 2024. Mobileagent: enhancing mobile control via human-machine interaction and sop integration.arXiv preprint arXiv:2401.04124(2024)

work page arXiv 2024
[5]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

2024
[6]

Google Android Developers. [n. d.]. Android Debug Bridge (adb). https://developer.android.com/tools/adb. ([n. d.])
[7]

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14281–14290

2024
[8]

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. 2025. Glm- 4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steve Ko, Sangeun Oh, and Insik Shin. 2024. Mobilegpt: Aug- menting llm with human-like app memory for mobile task automation. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 1119–1133

2024
[10]

Gang Li and Yang Li. 2022. Spotlight: Mobile ui understanding using vision-language models with a focus.arXiv preprint arXiv:2209.14927 (2022)

work page arXiv 2022
[11]

Wei Li, William E Bishop, Alice Li, Christopher Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. 2024. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems37 (2024), 92130–92154

2024
[12]

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. 2025. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference. 19498–19508

2025
[13]

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. 2024. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. [n. d.]. Android in the wild: A large-scale dataset for android device control, 2023.URL https://arxiv. org/abs/2307.10088 ([n. d.])

work page arXiv 2023
[15]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence em- beddings using siamese bert-networks.arXiv preprint arXiv:1908.10084 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[16]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. 2025. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi- agent collaboration.Advances in Neural Information Processing Systems 37 (2024), 2686–2710

2024
[19]

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. Autodroid: Llm-powered task automation in android. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 543–557

2024
[20]

Hao Wen, Shizuo Tian, Borislav Pavlov, Wenjie Du, Yixuan Li, Ge Chang, Shanhui Zhao, Jiacheng Liu, Yunxin Liu, Ya-Qin Zhang, et al
[21]

InProceedings of the 23rd Annual International Conference on Mobile Systems, Applications and Services

Autodroid-v2: Boosting slm-based gui agents via code generation. InProceedings of the 23rd Annual International Conference on Mobile Systems, Applications and Services. 223–235
[22]

Bin Xie, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Jie Liu, Min Zhang, and Liqiang Nie. 2025. Gui-explorer: Autonomous exploration and mining of transition-aware knowledge for gui agent. arXiv preprint arXiv:2505.16827(2025)

work page arXiv 2025
[23]

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, et al . 2025. Step-gui technical report.arXiv preprint arXiv:2512.15431(2025)

work page arXiv 2025
[24]

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhao- qing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al . 13
[25]

Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2024. Ferret-ui: Grounded mobile ui understanding with multimodal llms. InEuropean Conference on Computer Vision. Springer, 240–255

2024
[27]

Yao Zhang, Zijian Ma, Yunpu Ma, Zhen Han, Yu Wu, and Volker Tresp
[28]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic exploration. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23378–23386
[29]

Shanhui Zhao, Hao Wen, Wenjie Du, Cheng Liang, Yunxin Liu, Xi- aozhou Ye, Ye Ouyang, and Yuanchun Li. 2025. LLM-Explorer: Towards Efficient and Affordable LLM-based Exploration for Mobile Apps. In Proceedings of the 31st Annual International Conference on Mobile Com- puting and Networking. 589–603

2025
[30]

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. 2025. MAI-UI Technical Report: Real-World Centric Foundation GUI Agents.arXiv preprint arXiv:2512.22047(2025). 14

work page arXiv 2025

[1] [1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al . 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, and Lili Qiu. 2025. Advancing mobile gui agents: A verifier-driven approach to practical deployment.arXiv preprint arXiv:2503.15937(2025)

work page arXiv 2025

[4] [4]

Tinghe Ding. 2024. Mobileagent: enhancing mobile control via human-machine interaction and sop integration.arXiv preprint arXiv:2401.04124(2024)

work page arXiv 2024

[5] [5]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

2024

[6] [6]

Google Android Developers. [n. d.]. Android Debug Bridge (adb). https://developer.android.com/tools/adb. ([n. d.])

[7] [7]

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14281–14290

2024

[8] [8]

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. 2025. Glm- 4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steve Ko, Sangeun Oh, and Insik Shin. 2024. Mobilegpt: Aug- menting llm with human-like app memory for mobile task automation. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 1119–1133

2024

[10] [10]

Gang Li and Yang Li. 2022. Spotlight: Mobile ui understanding using vision-language models with a focus.arXiv preprint arXiv:2209.14927 (2022)

work page arXiv 2022

[11] [11]

Wei Li, William E Bishop, Alice Li, Christopher Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. 2024. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems37 (2024), 92130–92154

2024

[12] [12]

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. 2025. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference. 19498–19508

2025

[13] [13]

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. 2024. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. [n. d.]. Android in the wild: A large-scale dataset for android device control, 2023.URL https://arxiv. org/abs/2307.10088 ([n. d.])

work page arXiv 2023

[15] [15]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence em- beddings using siamese bert-networks.arXiv preprint arXiv:1908.10084 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[16] [16]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. 2025. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi- agent collaboration.Advances in Neural Information Processing Systems 37 (2024), 2686–2710

2024

[19] [19]

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. Autodroid: Llm-powered task automation in android. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 543–557

2024

[20] [20]

Hao Wen, Shizuo Tian, Borislav Pavlov, Wenjie Du, Yixuan Li, Ge Chang, Shanhui Zhao, Jiacheng Liu, Yunxin Liu, Ya-Qin Zhang, et al

[21] [21]

InProceedings of the 23rd Annual International Conference on Mobile Systems, Applications and Services

Autodroid-v2: Boosting slm-based gui agents via code generation. InProceedings of the 23rd Annual International Conference on Mobile Systems, Applications and Services. 223–235

[22] [22]

Bin Xie, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Jie Liu, Min Zhang, and Liqiang Nie. 2025. Gui-explorer: Autonomous exploration and mining of transition-aware knowledge for gui agent. arXiv preprint arXiv:2505.16827(2025)

work page arXiv 2025

[23] [23]

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, et al . 2025. Step-gui technical report.arXiv preprint arXiv:2512.15431(2025)

work page arXiv 2025

[24] [24]

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhao- qing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al . 13

[25] [25]

Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2024. Ferret-ui: Grounded mobile ui understanding with multimodal llms. InEuropean Conference on Computer Vision. Springer, 240–255

2024

[27] [27]

Yao Zhang, Zijian Ma, Yunpu Ma, Zhen Han, Yu Wu, and Volker Tresp

[28] [28]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic exploration. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23378–23386

[29] [29]

Shanhui Zhao, Hao Wen, Wenjie Du, Cheng Liang, Yunxin Liu, Xi- aozhou Ye, Ye Ouyang, and Yuanchun Li. 2025. LLM-Explorer: Towards Efficient and Affordable LLM-based Exploration for Mobile Apps. In Proceedings of the 31st Annual International Conference on Mobile Com- puting and Networking. 589–603

2025

[30] [30]

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. 2025. MAI-UI Technical Report: Real-World Centric Foundation GUI Agents.arXiv preprint arXiv:2512.22047(2025). 14

work page arXiv 2025