arxiv: 2602.21858 · v4 · submitted 2026-02-25 · 💻 cs.AI

Recognition: no theorem link

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

Dezhi Kong , Zhengzhao Feng , Qiliang Liang , Hao Wang , Haofei Sun , Changpeng Yang , Yang Li , Peng Zhou

show 7 more authors

Shuai Nie Hongzhen Wang Linfeng Zhou Hao Jia Jiaming Xu Runyu Shi Ying Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:41 UTC · model grok-4.3

classification 💻 cs.AI

keywords proactive intelligencemobile agentsmultimodal large language modelsbenchmarklatent intent inferenceexecutable API sequencesMLLM evaluation

0 comments

The pith

ProactiveMobile benchmark shows that current multimodal models lack proactive intelligence on mobile devices but can learn it through fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a new benchmark to measure and train proactive behavior in mobile agents driven by multimodal large language models. These models currently wait for explicit user commands rather than using on-device context to anticipate needs and act on their own. The benchmark turns proactivity into a concrete task: infer hidden user intent from four kinds of contextual signals and output a valid sequence of calls to one of 63 device APIs. Over 3,660 test cases span 14 realistic scenarios, each with multiple acceptable answers that were reviewed by 30 experts for accuracy and feasibility. Experiments show a fine-tuned 7-billion-parameter model reaches 19 percent success, ahead of o1 and GPT-5, indicating that proactivity is a skill that can be improved rather than an innate limitation.

Core claim

ProactiveMobile formalizes proactive intelligence as the ability to infer latent user intent from on-device contextual signals across four dimensions and to generate executable function sequences drawn from a pool of 63 APIs. The benchmark supplies more than 3,660 instances across 14 scenarios with multi-answer annotations that were audited by 30 experts for factual accuracy, logical consistency, and action feasibility. When a Qwen2.5-VL-7B-Instruct model is fine-tuned on this data it attains a 19.15 percent success rate, exceeding the 15.71 percent of o1 and the 7.39 percent of GPT-5, demonstrating that proactivity is both missing in current MLLMs and learnable with targeted training.

What carries the argument

The ProactiveMobile benchmark, which defines proactive tasks as inferring latent user intent from four dimensions of on-device signals and producing executable sequences from a 63-API function pool.

If this is right

Fine-tuning on proactive examples raises success rates above those of much larger frontier models.
Objective, executable evaluation of proactivity becomes possible for the first time at mobile scale.
Proactivity should be treated as a trainable competency rather than an inherent shortfall of MLLMs.
Future mobile-agent development can use the same benchmark to track and compare gains in autonomous anticipation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Mobile assistants could shift from always waiting for commands to quietly preparing actions based on context, reducing user effort.
Similar benchmarks built for desktop or web environments might reveal whether proactivity transfers across device types.
The performance gap suggests that collecting and curating proactive training data is now a high-leverage research direction.

Load-bearing premise

That the 14 chosen scenarios together with their multi-answer annotations and expert audit by 30 reviewers are sufficient to represent real-world mobile complexity and to guarantee factual accuracy, logical consistency, and action feasibility.

What would settle it

Evaluating the same fine-tuned model on a fresh set of real mobile-device interaction logs collected outside the 14 scenarios and finding that its proactive success rate falls back to or below the levels of o1 and GPT-5.

Figures

Figures reproduced from arXiv: 2602.21858 by Changpeng Yang, Dezhi Kong, Haofei Sun, Hao Jia, Hao Wang, Hongzhen Wang, Jiaming Xu, Linfeng Zhou, Peng Zhou, Qiliang Liang, Runyu Shi, Shuai Nie, Yang Li, Ying Huang, Zhengzhao Feng.

**Figure 2.** Figure 2: The overview of data generation. decision moment. For each decision point, there may exist multiple valid intent–function pairs, denoted as the groundtruth set T . The model generates a single predicted pair: ( ˆI,Fˆ ) = Mθ(U, D,W, B), Fˆ = ( (f1, . . . , fn), ˆI ̸= ∅ ∧ ˆI ⇒ F , ˆ ∅, ˆI = ∅ ∨ ˆI ⇏ F , ˆ (2) where Fˆ is non-empty only if ˆI is actionable and can be mapped to at least one function from the … view at source ↗

**Figure 3.** Figure 3: Distribution of the 14 primary user intent categories, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have made significant progress in mobile agent development, yet their capabilities are predominantly confined to a reactive paradigm, where they merely execute explicit user commands. The emerging paradigm of proactive intelligence, where agents autonomously anticipate needs and initiate actions, represents the next frontier for mobile agents. However, its development is critically bottlenecked by the lack of benchmarks that can address real-world complexity and enable objective, executable evaluation. To overcome these challenges, we introduce ProactiveMobile, a comprehensive benchmark designed to systematically advance research in this domain. ProactiveMobile formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals and generating an executable function sequence from a comprehensive function pool of 63 APIs. The benchmark features over 3,660 instances of 14 scenarios that embrace real-world complexity through multi-answer annotations. To ensure quality, a team of 30 experts conducts a final audit of the benchmark, verifying factual accuracy, logical consistency, and action feasibility, and correcting any non-compliant entries. Extensive experiments demonstrate that our fine-tuned Qwen2.5-VL-7B-Instruct achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%). This result indicates that proactivity is a critical competency widely lacking in current MLLMs, yet it is learnable, emphasizing the importance of the proposed benchmark for proactivity evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProactiveMobile adds a timely benchmark for proactive mobile agents with some evidence that fine-tuning helps, but the success metric is too vaguely defined to trust the reported numbers yet.

read the letter

The main takeaway is that this paper gives us a new benchmark for proactive intelligence on mobile devices, with 3,660 instances across 14 scenarios, a clear task setup around intent inference from on-device signals and API sequence generation from 63 functions, plus multi-answer labels and a 30-expert audit. That setup is more structured than most reactive agent benchmarks, and the result that a fine-tuned Qwen2.5-VL-7B reaches 19.15% while o1 sits at 15.71% and GPT-5 at 7.39% at least shows the capability is trainable rather than purely out of reach for current models. The construction steps and scale are concrete contributions worth having on the table. The soft spot is the evaluation rule itself. The abstract never defines what counts as success on the multi-answer cases—exact sequence match, semantic equivalence, partial credit, or simulator execution—so the headline percentages cannot be reproduced or compared reliably. Baseline implementation details are also thin. This is aimed at researchers building mobile agents and MLLMs who need a proactive testbed. The benchmark could become useful once the scoring is pinned down and the data released. It deserves peer review to sort out the metric and check reproducibility rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces ProactiveMobile, a benchmark for proactive intelligence in mobile MLLM agents. It formalizes proactive tasks as inferring latent user intent from four dimensions of on-device contextual signals and generating executable sequences from a pool of 63 APIs. The benchmark contains 3,660 instances across 14 scenarios with multi-answer annotations; 30 experts audited entries for factual accuracy, logical consistency, and action feasibility. Experiments report that fine-tuning Qwen2.5-VL-7B-Instruct achieves a 19.15% success rate, outperforming o1 (15.71%) and GPT-5 (7.39%), and conclude that proactivity is learnable.

Significance. If the evaluation protocol is reproducible, the benchmark would address a genuine gap between reactive and proactive mobile agents and supply the first large-scale, executable testbed with expert verification. The finding that fine-tuning improves performance on this task would be a useful existence proof and could stimulate further work on intent anticipation. The scale (3,660 instances) and expert audit are concrete strengths that distinguish the contribution from smaller or unverified datasets.

major comments (3)

[Experiments / Evaluation Protocol] The success-rate definition used for the headline numbers (19.15% for the fine-tuned model, 15.71% for o1, 7.39% for GPT-5) is not stated. In particular, it is unclear whether success requires an exact API-sequence match, semantic equivalence to any of the multi-answer annotations, partial credit, or execution simulation. Without this definition the reported percentages cannot be reproduced or compared, directly undermining the central claim that proactivity is learnable.
[Benchmark Construction] The paper states that 30 experts audited the 3,660 instances for factual accuracy, logical consistency, and action feasibility, yet supplies no inter-annotator agreement statistics, decision rules for corrections, or exclusion criteria. This information is load-bearing for the claim that the benchmark “guarantees” objective evaluation.
[Baselines and Experimental Setup] Implementation details for the o1 and GPT-5 baselines are missing: prompt templates, how the 63-API pool was presented, temperature settings, and any post-processing of generated sequences. These omissions prevent verification that the 15.71% and 7.39% figures were obtained under the same protocol as the fine-tuned model.

minor comments (2)

[Abstract] The abstract claims the benchmark “embraces real-world complexity” but does not describe how the 14 scenarios were sampled from actual mobile usage logs or validated against external distributions.
[Task Formalization] Notation for the four dimensions of contextual signals and the function pool should be introduced with a small table or diagram in the main text rather than only in the appendix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of reproducibility and transparency. We address each major comment below and have revised the manuscript accordingly to strengthen the paper.

read point-by-point responses

Referee: [Experiments / Evaluation Protocol] The success-rate definition used for the headline numbers (19.15% for the fine-tuned model, 15.71% for o1, 7.39% for GPT-5) is not stated. In particular, it is unclear whether success requires an exact API-sequence match, semantic equivalence to any of the multi-answer annotations, partial credit, or execution simulation. Without this definition the reported percentages cannot be reproduced or compared, directly undermining the central claim that proactivity is learnable.

Authors: We agree that the success-rate definition must be stated explicitly for reproducibility. In the revised manuscript, we have added a new subsection in the Experiments section that defines success as semantic equivalence to any of the multi-answer annotations (determined by matching core actions, parameters, and intent), verified via execution simulation in the mobile environment. Exact sequence match is not required, and we clarify that no partial credit is awarded; full success requires the sequence to achieve the intended outcome. revision: yes
Referee: [Benchmark Construction] The paper states that 30 experts audited the 3,660 instances for factual accuracy, logical consistency, and action feasibility, yet supplies no inter-annotator agreement statistics, decision rules for corrections, or exclusion criteria. This information is load-bearing for the claim that the benchmark “guarantees” objective evaluation.

Authors: We acknowledge that inter-annotator agreement statistics were not computed or reported. The audit was conducted iteratively with consensus among the 30 experts rather than independent parallel annotations. In the revision, we expand the Benchmark Construction section to detail the decision rules (majority consensus for corrections), exclusion criteria (instances with unresolved factual or feasibility issues were removed), and the overall verification process, while noting the absence of formal IAA metrics as a limitation. revision: partial
Referee: [Baselines and Experimental Setup] Implementation details for the o1 and GPT-5 baselines are missing: prompt templates, how the 63-API pool was presented, temperature settings, and any post-processing of generated sequences. These omissions prevent verification that the 15.71% and 7.39% figures were obtained under the same protocol as the fine-tuned model.

Authors: We agree that these details are necessary for fair comparison. The revised manuscript includes a new appendix with the full prompt templates for o1 and GPT-5, the presentation of the 63-API pool (as a structured JSON schema in the system prompt), temperature settings (set to 0 for deterministic generation), and post-processing steps (sequence parsing, validation against the API pool, and execution simulation). This ensures all models were evaluated under the identical protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and empirical measurements are self-contained

full rationale

The paper introduces ProactiveMobile as a new benchmark with 3,660 instances across 14 scenarios, multi-answer annotations, and expert audit by 30 reviewers for factual accuracy, logical consistency, and action feasibility. It reports measured success rates (19.15% for fine-tuned Qwen2.5-VL-7B-Instruct vs. baselines) directly on this benchmark. No mathematical derivations, equations, fitted parameters, or self-citations appear that reduce any claim to its own inputs by construction. The success rates are independent empirical outputs on the proposed dataset rather than predictions or renamings that loop back to the benchmark definition itself. The contribution remains the benchmark construction plus measured performances, with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that proactive intelligence can be operationalized through context-based intent inference and API sequence generation. No free parameters or new entities are introduced.

axioms (1)

domain assumption Proactive intelligence can be formalized as inferring latent user intent across four dimensions of on-device contextual signals and generating executable function sequences from a pool of 63 APIs.
This definition is used to construct the benchmark and evaluation protocol as described in the abstract.

pith-pipeline@v0.9.0 · 5610 in / 1355 out tokens · 30805 ms · 2026-05-15T19:41:05.024948+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 4 internal anchors

[1]

Claude 3.7 Sonnet and Claude Code.https: / / www

Anthropic. Claude 3.7 Sonnet and Claude Code.https: / / www . anthropic . com / news / claude - 3 - 7 - sonnet, 2025. Accessed: 2025-11-13. 4, 5

work page 2025
[2]

Introducing Claude 4.https : / / www

Anthropic. Introducing Claude 4.https : / / www . anthropic.com/news/claude-4, 2025. Accessed: 2025-11-13. 4, 5

work page 2025
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

On the ability of virtual agents to decrease cognitive load: an experimental study.Information Systems and e-Business Management, 18(2):187–207, 2020

Florian Brachten, Felix Br ¨unker, Nicholas RJ Frick, Bj ¨orn Ross, and Stefan Stieglitz. On the ability of virtual agents to decrease cognitive load: an experimental study.Information Systems and e-Business Management, 18(2):187–207, 2020. 1

work page 2020
[5]

Smart help: Strategic opponent modeling for proactive and adaptive robot assistance in households

Zhihao Cao, Zidong Wang, Siwen Xie, Anji Liu, and Lifeng Fan. Smart help: Strategic opponent modeling for proactive and adaptive robot assistance in households. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 18091–18101, 2024. 3

work page 2024
[6]

V2p: From background suppression to center peaking for robust gui grounding task, 2025

Jikai Chen, Long Chen, Dong Wang, Leilei Gan, Chenyi Zhuang, and Jinjie Gu. V2p: From background suppression to center peaking for robust gui grounding task, 2025. 3

work page 2025
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 4, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Mobile-bench: An evaluation bench- mark for LLM-based mobile agents

Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Liujianfeng Liujianfeng, Ang Li, Jian Luan, Bin Wang, Rui Yan, and Shuo Shang. Mobile-bench: An evaluation bench- mark for LLM-based mobile agents. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 8813– 8831, Bangkok, Thailand, 2...

work page 2024
[9]

A survey on proactive dialogue systems: Problems, methods, and prospects, 2023

Yang Deng, Wenqiang Lei, Wai Lam, and Tat-Seng Chua. A survey on proactive dialogue systems: Problems, methods, and prospects, 2023. 1, 3

work page 2023
[10]

Proactive conversational ai: A comprehensive survey of advancements and opportuni- ties.ACM Transactions on Information Systems, 43(3):1–45,

Yang Deng, Lizi Liao, Wenqiang Lei, Grace Hui Yang, Wai Lam, and Tat-Seng Chua. Proactive conversational ai: A comprehensive survey of advancements and opportuni- ties.ACM Transactions on Information Systems, 43(3):1–45,

work page
[11]

Gemini 2.0: Flash, Flash-Lite and Pro.https: //developers.googleblog.com/en/gemini-2- family-expands/, 2025

Google. Gemini 2.0: Flash, Flash-Lite and Pro.https: //developers.googleblog.com/en/gemini-2- family-expands/, 2025. Accessed: 2025-11-13. 4

work page 2025
[12]

Navigating the digital world as humans do: Universal visual grounding for gui agents, 2025

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents, 2025. 1, 3

work page 2025
[13]

Ui-venus technical report: Building high-performance ui agents with rft, 2025

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, and Weiqiang Wang. Ui-venus technical report: Building high-performance ...

work page 2025
[14]

Os agents: A survey on mllm-based agents for computer, phone and browser use

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, et al. Os agents: A survey on mllm-based agents for computer, phone and browser use. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 7436–7465, 2025. 1

work page 2025
[15]

Winspot: Gui grounding benchmark with multimodal large language models

Zheng Hui, Yinheng Li, Dan Zhao, Colby Banbury, Tianyi Chen, and Kazuhito Koishida. Winspot: Gui grounding benchmark with multimodal large language models. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1086–1096, 2025. 3

work page 2025
[16]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 4, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Auto-intent: Automated in- tent discovery and self-exploration for large language model web agents

Jaekyeom Kim, Dong-Ki Kim, Lajanugen Logeswaran, Sun- gryull Sohn, and Honglak Lee. Auto-intent: Automated in- tent discovery and self-exploration for large language model web agents. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 16531–16541, 2024. 3

work page 2024
[19]

Autogui: Scaling gui grounding with automatic functionality annotations from llms.arXiv preprint arXiv:2502.01977, 2025

Hongxin Li, Jingfan Chen, Jingran Su, Yuntao Chen, Qing Li, and Zhaoxiang Zhang. Autogui: Scaling gui grounding with automatic functionality annotations from llms.arXiv preprint arXiv:2502.01977, 2025. 3

work page arXiv 2025
[20]

Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv preprint arXiv:2504.07981,

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv preprint arXiv:2504.07981,

work page arXiv
[21]

Appagent v2: Advanced agent for flexible mobile interactions.arXiv preprint arXiv:2408.11824, 2024

Yanda Li, Chi Zhang, Wenjia Jiang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, and Yunchao Wei. Appagent v2: Advanced agent for flexible mobile interactions.arXiv preprint arXiv:2408.11824, 2024. 1 9

work page arXiv 2024
[22]

Proactive con- versational agents in the post-chatgpt world

Lizi Liao, Grace Hui Yang, and Chirag Shah. Proactive con- versational agents in the post-chatgpt world. InProceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, pages 3452–3455,

work page
[23]

Ui-e2i- synth: Advancing gui grounding with large-scale instruction synthesis, 2025

Xinyi Liu, Xiaoyi Zhang, Ziyun Zhang, and Yan Lu. Ui-e2i- synth: Advancing gui grounding with large-scale instruction synthesis, 2025. 3

work page 2025
[24]

Proactive conversational agents with inner thoughts

Xingyu Bruce Liu, Shitao Fang, Weiyan Shi, Chien-Sheng Wu, Takeo Igarashi, and Xiang’Anthony’ Chen. Proactive conversational agents with inner thoughts. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–19, 2025. 1

work page 2025
[25]

GUI odyssey: A comprehensive dataset for cross-app GUI navigation on mobile devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Box- uan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, and Ping Luo. Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices.arXiv preprint arXiv:2406.08451, 2024. 3, 4

work page arXiv 2024
[26]

Proactive agent: Shifting llm agents from reactive responses to active assistance.arXiv preprint arXiv:2410.12361, 2024

Yaxi Lu, Shenzhi Yang, Cheng Qian, Guirong Chen, Qinyu Luo, Yesai Wu, Huadong Wang, Xin Cong, Zhong Zhang, Yankai Lin, et al. Proactive agent: Shifting llm agents from reactive responses to active assistance.arXiv preprint arXiv:2410.12361, 2024. 1, 3

work page arXiv 2024
[27]

Ui-r1: Enhancing efficient action predic- tion of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620, 2025

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing efficient action predic- tion of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620, 2025. 3

work page arXiv 2025
[28]

Gui-r1 : A generalist r1-style vision- language action model for gui agents, 2025

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1 : A generalist r1-style vision- language action model for gui agents, 2025. 3

work page 2025
[29]

GPT-5 System Card

OpenAI. GPT-5 System Card. Technical report, OpenAI,

work page
[30]

Accessed: 2025-08-10. 2, 4, 6

work page 2025
[31]

Navigating the unknown: A chat-based collaborative interface for personalized exploratory tasks

Yingzhe Peng, Xiaoting Qin, Zhiyang Zhang, Jue Zhang, Qingwei Lin, Xu Yang, Dongmei Zhang, Saravan Rajmo- han, and Qi Zhang. Navigating the unknown: A chat-based collaborative interface for personalized exploratory tasks. In Proceedings of the 30th International Conference on Intelli- gent User Interfaces, pages 1048–1063, 2025. 1

work page 2025
[32]

Tell me more! towards implicit user intention un- derstanding of language model driven agents

Cheng Qian, Bingxiang He, Zhong Zhuang, Jia Deng, Yujia Qin, Xin Cong, Zhong Zhang, Jie Zhou, Yankai Lin, Zhiyuan Liu, et al. Tell me more! towards implicit user intention un- derstanding of language model driven agents. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1088– 1113, 2024. 3

work page 2024
[33]

Lee, Giorgio Tran, Nurit Kirshenbaum, and Jason Leigh

Roderick Tabalba, Christopher J. Lee, Giorgio Tran, Nurit Kirshenbaum, and Jason Leigh. Articulatepro: A compar- ative study on a proactive and non-proactive assistant in a climate data exploration task, 2024. 3

work page 2024
[34]

Gui-g 2: Gaussian reward modeling for gui ground- ing, 2025

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Gui-g 2: Gaussian reward modeling for gui ground- ing, 2025. 3

work page 2025
[35]

A survey on (m)llm-based gui agents,

Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, and Yueting Zhuang. A survey on (m)llm-based gui agents,

work page
[36]

Mimo-vl technical report, 2025

Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhix- ian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huan...

work page 2025
[37]

Mobile-agent-v2: Mobile device operation assistant with ef- fective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710,

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with ef- fective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710,

work page
[38]

Mp-gui: Modality percep- tion with mllms for gui understanding

Ziwei Wang, Weizhi Chen, Leyang Yang, Sheng Zhou, Shengchu Zhao, Hanbei Zhan, Jiongchao Jin, Liangcheng Li, Zirui Shao, and Jiajun Bu. Mp-gui: Modality percep- tion with mllms for gui understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29711–29721, 2025. 3

work page 2025
[39]

Mobile-agent- e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent- e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025. 1

work page arXiv 2025
[40]

Beyond examples: High-level automated reasoning paradigm in in- context learning via mcts.arXiv preprint arXiv:2411.18478,

Jinyang Wu, Mingkuan Feng, Shuai Zhang, Feihu Che, Zengqi Wen, Chonghua Liao, and Jianhua Tao. Beyond examples: High-level automated reasoning paradigm in in- context learning via mcts.arXiv preprint arXiv:2411.18478,

work page arXiv
[41]

Atlas: Orchestrating heterogeneous models and tools for multi-domain complex reasoning.arXiv preprint arXiv:2601.03872, 2026

Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, and Jianhua Tao. Atlas: Orchestrating heterogeneous models and tools for multi-domain complex reasoning.arXiv preprint arXiv:2601.03872, 2026. 3

work page arXiv 2026
[42]

Os-atlas: A foundation action model for generalist gui agents, 2024

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024. 3

work page 2024
[43]

Mirage-1: Augmenting and updating gui agent with hierar- chical multimodal skills, 2025

Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kai- wen Zhou, Yinchuan Li, Dongmei Jiang, and Liqiang Nie. Mirage-1: Augmenting and updating gui agent with hierar- chical multimodal skills, 2025. 3

work page 2025
[44]

Contextagent: Context-aware proac- tive llm agents with open-world sensory perceptions, 2025

Bufang Yang, Lilin Xu, Liekang Zeng, Kaiwei Liu, Siyang Jiang, Wenrui Lu, Hongkai Chen, Xiaofan Jiang, Guoliang 10 Xing, and Zhenyu Yan. Contextagent: Context-aware proac- tive llm agents with open-world sensory perceptions, 2025. 1

work page 2025
[45]

Fingertip 20k: A benchmark for proactive and personalized mobile llm agents

Qinglong Yang, Haoming Li, Haotian Zhao, Xiaokai Yan, Jingtao Ding, Fengli Xu, and Yong Li. Fingertip 20k: A benchmark for proactive and personalized mobile llm agents. arXiv preprint arXiv:2507.21071, 2025. 1, 3

work page arXiv 2025
[46]

Aria-ui: Visual grounding for gui instructions.arXiv preprint arXiv:2412.16256, 2024

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions.arXiv preprint arXiv:2412.16256, 2024. 3

work page arXiv 2024
[47]

A survey on agentic multimodal large language models, 2025

Huanjin Yao, Ruifei Zhang, Jiaxing Huang, Jingyi Zhang, Yibo Wang, Bo Fang, Ruolin Zhu, Yongcheng Jing, Shunyu Liu, Guanbin Li, and Dacheng Tao. A survey on agentic multimodal large language models, 2025. 1

work page 2025
[48]

A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 1

work page 2024
[49]

Large language model-brained gui agents: A survey, 2025

Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Large language model-brained gui agents: A survey, 2025. 3

work page 2025
[50]

Mm-llms: Recent ad- vances in multimodal large language models.arXiv preprint arXiv:2401.13601, 2024

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent ad- vances in multimodal large language models.arXiv preprint arXiv:2401.13601, 2024. 1

work page arXiv 2024
[51]

Android in the zoo: Chain-of-action-thought for gui agents

Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12016–12031, 2024. 3, 4

work page 2024
[52]

Dynamic planning for llm-based graphical user interface automation

Shaoqing Zhang, Zhuosheng Zhang, Kehai Chen, Xinbei Ma, Muyun Yang, Tiejun Zhao, and Min Zhang. Dynamic planning for llm-based graphical user interface automation. InFindings of the Association for Computational Linguis- tics: EMNLP 2024, pages 1304–1320, 2024. 3

work page 2024
[53]

Ask-before-plan: Proactive language agents for real-world planning

Xuan Zhang, Yang Deng, Zifeng Ren, See Kiong Ng, and Tat-Seng Chua. Ask-before-plan: Proactive language agents for real-world planning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 10836– 10863, 2024. 3

work page 2024
[54]

AgentCPM-GUI: Build- ing mobile-use agents with reinforcement fine-tuning.arXiv preprint arXiv:2506.01391, 2025

Zhong Zhang, Yaxi Lu, Yikun Fu, Yupeng Huo, Shenzhi Yang, Yesai Wu, Han Si, Xin Cong, Haotian Chen, Yankai Lin, Jie Xie, Wei Zhou, Wang Xu, Yuanheng Zhang, Zhou Su, Zhongwu Zhai, Xiaoming Liu, Yudong Mei, Jianming Xu, Hongyan Tian, Chongyi Wang, Chi Chen, Yuan Yao, Zhiyuan Liu, and Maosong Sun. AgentCPM-GUI: Build- ing mobile-use agents with reinforcement...

work page arXiv 2025
[55]

Worldgui: An interactive benchmark for desktop gui automation from any starting point.arXiv preprint arXiv:2502.08047, 2025

Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, and Mike Zheng Shou. Worldgui: An interactive benchmark for desktop gui automation from any starting point.arXiv preprint arXiv:2502.08047, 2025. 3

work page arXiv 2025
[56]

Gui-g1: Understanding r1-zero-like train- ing for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like train- ing for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025. 1, 3 11

work page arXiv 2025