Recognition: no theorem link
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices
Pith reviewed 2026-05-15 19:41 UTC · model grok-4.3
The pith
ProactiveMobile benchmark shows that current multimodal models lack proactive intelligence on mobile devices but can learn it through fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ProactiveMobile formalizes proactive intelligence as the ability to infer latent user intent from on-device contextual signals across four dimensions and to generate executable function sequences drawn from a pool of 63 APIs. The benchmark supplies more than 3,660 instances across 14 scenarios with multi-answer annotations that were audited by 30 experts for factual accuracy, logical consistency, and action feasibility. When a Qwen2.5-VL-7B-Instruct model is fine-tuned on this data it attains a 19.15 percent success rate, exceeding the 15.71 percent of o1 and the 7.39 percent of GPT-5, demonstrating that proactivity is both missing in current MLLMs and learnable with targeted training.
What carries the argument
The ProactiveMobile benchmark, which defines proactive tasks as inferring latent user intent from four dimensions of on-device signals and producing executable sequences from a 63-API function pool.
If this is right
- Fine-tuning on proactive examples raises success rates above those of much larger frontier models.
- Objective, executable evaluation of proactivity becomes possible for the first time at mobile scale.
- Proactivity should be treated as a trainable competency rather than an inherent shortfall of MLLMs.
- Future mobile-agent development can use the same benchmark to track and compare gains in autonomous anticipation.
Where Pith is reading between the lines
- Mobile assistants could shift from always waiting for commands to quietly preparing actions based on context, reducing user effort.
- Similar benchmarks built for desktop or web environments might reveal whether proactivity transfers across device types.
- The performance gap suggests that collecting and curating proactive training data is now a high-leverage research direction.
Load-bearing premise
That the 14 chosen scenarios together with their multi-answer annotations and expert audit by 30 reviewers are sufficient to represent real-world mobile complexity and to guarantee factual accuracy, logical consistency, and action feasibility.
What would settle it
Evaluating the same fine-tuned model on a fresh set of real mobile-device interaction logs collected outside the 14 scenarios and finding that its proactive success rate falls back to or below the levels of o1 and GPT-5.
Figures
read the original abstract
Multimodal large language models (MLLMs) have made significant progress in mobile agent development, yet their capabilities are predominantly confined to a reactive paradigm, where they merely execute explicit user commands. The emerging paradigm of proactive intelligence, where agents autonomously anticipate needs and initiate actions, represents the next frontier for mobile agents. However, its development is critically bottlenecked by the lack of benchmarks that can address real-world complexity and enable objective, executable evaluation. To overcome these challenges, we introduce ProactiveMobile, a comprehensive benchmark designed to systematically advance research in this domain. ProactiveMobile formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals and generating an executable function sequence from a comprehensive function pool of 63 APIs. The benchmark features over 3,660 instances of 14 scenarios that embrace real-world complexity through multi-answer annotations. To ensure quality, a team of 30 experts conducts a final audit of the benchmark, verifying factual accuracy, logical consistency, and action feasibility, and correcting any non-compliant entries. Extensive experiments demonstrate that our fine-tuned Qwen2.5-VL-7B-Instruct achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%). This result indicates that proactivity is a critical competency widely lacking in current MLLMs, yet it is learnable, emphasizing the importance of the proposed benchmark for proactivity evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ProactiveMobile, a benchmark for proactive intelligence in mobile MLLM agents. It formalizes proactive tasks as inferring latent user intent from four dimensions of on-device contextual signals and generating executable sequences from a pool of 63 APIs. The benchmark contains 3,660 instances across 14 scenarios with multi-answer annotations; 30 experts audited entries for factual accuracy, logical consistency, and action feasibility. Experiments report that fine-tuning Qwen2.5-VL-7B-Instruct achieves a 19.15% success rate, outperforming o1 (15.71%) and GPT-5 (7.39%), and conclude that proactivity is learnable.
Significance. If the evaluation protocol is reproducible, the benchmark would address a genuine gap between reactive and proactive mobile agents and supply the first large-scale, executable testbed with expert verification. The finding that fine-tuning improves performance on this task would be a useful existence proof and could stimulate further work on intent anticipation. The scale (3,660 instances) and expert audit are concrete strengths that distinguish the contribution from smaller or unverified datasets.
major comments (3)
- [Experiments / Evaluation Protocol] The success-rate definition used for the headline numbers (19.15% for the fine-tuned model, 15.71% for o1, 7.39% for GPT-5) is not stated. In particular, it is unclear whether success requires an exact API-sequence match, semantic equivalence to any of the multi-answer annotations, partial credit, or execution simulation. Without this definition the reported percentages cannot be reproduced or compared, directly undermining the central claim that proactivity is learnable.
- [Benchmark Construction] The paper states that 30 experts audited the 3,660 instances for factual accuracy, logical consistency, and action feasibility, yet supplies no inter-annotator agreement statistics, decision rules for corrections, or exclusion criteria. This information is load-bearing for the claim that the benchmark “guarantees” objective evaluation.
- [Baselines and Experimental Setup] Implementation details for the o1 and GPT-5 baselines are missing: prompt templates, how the 63-API pool was presented, temperature settings, and any post-processing of generated sequences. These omissions prevent verification that the 15.71% and 7.39% figures were obtained under the same protocol as the fine-tuned model.
minor comments (2)
- [Abstract] The abstract claims the benchmark “embraces real-world complexity” but does not describe how the 14 scenarios were sampled from actual mobile usage logs or validated against external distributions.
- [Task Formalization] Notation for the four dimensions of contextual signals and the function pool should be introduced with a small table or diagram in the main text rather than only in the appendix.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of reproducibility and transparency. We address each major comment below and have revised the manuscript accordingly to strengthen the paper.
read point-by-point responses
-
Referee: [Experiments / Evaluation Protocol] The success-rate definition used for the headline numbers (19.15% for the fine-tuned model, 15.71% for o1, 7.39% for GPT-5) is not stated. In particular, it is unclear whether success requires an exact API-sequence match, semantic equivalence to any of the multi-answer annotations, partial credit, or execution simulation. Without this definition the reported percentages cannot be reproduced or compared, directly undermining the central claim that proactivity is learnable.
Authors: We agree that the success-rate definition must be stated explicitly for reproducibility. In the revised manuscript, we have added a new subsection in the Experiments section that defines success as semantic equivalence to any of the multi-answer annotations (determined by matching core actions, parameters, and intent), verified via execution simulation in the mobile environment. Exact sequence match is not required, and we clarify that no partial credit is awarded; full success requires the sequence to achieve the intended outcome. revision: yes
-
Referee: [Benchmark Construction] The paper states that 30 experts audited the 3,660 instances for factual accuracy, logical consistency, and action feasibility, yet supplies no inter-annotator agreement statistics, decision rules for corrections, or exclusion criteria. This information is load-bearing for the claim that the benchmark “guarantees” objective evaluation.
Authors: We acknowledge that inter-annotator agreement statistics were not computed or reported. The audit was conducted iteratively with consensus among the 30 experts rather than independent parallel annotations. In the revision, we expand the Benchmark Construction section to detail the decision rules (majority consensus for corrections), exclusion criteria (instances with unresolved factual or feasibility issues were removed), and the overall verification process, while noting the absence of formal IAA metrics as a limitation. revision: partial
-
Referee: [Baselines and Experimental Setup] Implementation details for the o1 and GPT-5 baselines are missing: prompt templates, how the 63-API pool was presented, temperature settings, and any post-processing of generated sequences. These omissions prevent verification that the 15.71% and 7.39% figures were obtained under the same protocol as the fine-tuned model.
Authors: We agree that these details are necessary for fair comparison. The revised manuscript includes a new appendix with the full prompt templates for o1 and GPT-5, the presentation of the 63-API pool (as a structured JSON schema in the system prompt), temperature settings (set to 0 for deterministic generation), and post-processing steps (sequence parsing, validation against the API pool, and execution simulation). This ensures all models were evaluated under the identical protocol. revision: yes
Circularity Check
No circularity: benchmark and empirical measurements are self-contained
full rationale
The paper introduces ProactiveMobile as a new benchmark with 3,660 instances across 14 scenarios, multi-answer annotations, and expert audit by 30 reviewers for factual accuracy, logical consistency, and action feasibility. It reports measured success rates (19.15% for fine-tuned Qwen2.5-VL-7B-Instruct vs. baselines) directly on this benchmark. No mathematical derivations, equations, fitted parameters, or self-citations appear that reduce any claim to its own inputs by construction. The success rates are independent empirical outputs on the proposed dataset rather than predictions or renamings that loop back to the benchmark definition itself. The contribution remains the benchmark construction plus measured performances, with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Proactive intelligence can be formalized as inferring latent user intent across four dimensions of on-device contextual signals and generating executable function sequences from a pool of 63 APIs.
Reference graph
Works this paper leans on
-
[1]
Claude 3.7 Sonnet and Claude Code.https: / / www
Anthropic. Claude 3.7 Sonnet and Claude Code.https: / / www . anthropic . com / news / claude - 3 - 7 - sonnet, 2025. Accessed: 2025-11-13. 4, 5
work page 2025
-
[2]
Introducing Claude 4.https : / / www
Anthropic. Introducing Claude 4.https : / / www . anthropic.com/news/claude-4, 2025. Accessed: 2025-11-13. 4, 5
work page 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Florian Brachten, Felix Br ¨unker, Nicholas RJ Frick, Bj ¨orn Ross, and Stefan Stieglitz. On the ability of virtual agents to decrease cognitive load: an experimental study.Information Systems and e-Business Management, 18(2):187–207, 2020. 1
work page 2020
-
[5]
Smart help: Strategic opponent modeling for proactive and adaptive robot assistance in households
Zhihao Cao, Zidong Wang, Siwen Xie, Anji Liu, and Lifeng Fan. Smart help: Strategic opponent modeling for proactive and adaptive robot assistance in households. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 18091–18101, 2024. 3
work page 2024
-
[6]
V2p: From background suppression to center peaking for robust gui grounding task, 2025
Jikai Chen, Long Chen, Dong Wang, Leilei Gan, Chenyi Zhuang, and Jinjie Gu. V2p: From background suppression to center peaking for robust gui grounding task, 2025. 3
work page 2025
-
[7]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 4, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Mobile-bench: An evaluation bench- mark for LLM-based mobile agents
Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Liujianfeng Liujianfeng, Ang Li, Jian Luan, Bin Wang, Rui Yan, and Shuo Shang. Mobile-bench: An evaluation bench- mark for LLM-based mobile agents. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 8813– 8831, Bangkok, Thailand, 2...
work page 2024
-
[9]
A survey on proactive dialogue systems: Problems, methods, and prospects, 2023
Yang Deng, Wenqiang Lei, Wai Lam, and Tat-Seng Chua. A survey on proactive dialogue systems: Problems, methods, and prospects, 2023. 1, 3
work page 2023
-
[10]
Yang Deng, Lizi Liao, Wenqiang Lei, Grace Hui Yang, Wai Lam, and Tat-Seng Chua. Proactive conversational ai: A comprehensive survey of advancements and opportuni- ties.ACM Transactions on Information Systems, 43(3):1–45,
-
[11]
Google. Gemini 2.0: Flash, Flash-Lite and Pro.https: //developers.googleblog.com/en/gemini-2- family-expands/, 2025. Accessed: 2025-11-13. 4
work page 2025
-
[12]
Navigating the digital world as humans do: Universal visual grounding for gui agents, 2025
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents, 2025. 1, 3
work page 2025
-
[13]
Ui-venus technical report: Building high-performance ui agents with rft, 2025
Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, and Weiqiang Wang. Ui-venus technical report: Building high-performance ...
work page 2025
-
[14]
Os agents: A survey on mllm-based agents for computer, phone and browser use
Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, et al. Os agents: A survey on mllm-based agents for computer, phone and browser use. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 7436–7465, 2025. 1
work page 2025
-
[15]
Winspot: Gui grounding benchmark with multimodal large language models
Zheng Hui, Yinheng Li, Dan Zhao, Colby Banbury, Tianyi Chen, and Kazuhito Koishida. Winspot: Gui grounding benchmark with multimodal large language models. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1086–1096, 2025. 3
work page 2025
-
[16]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 4, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Auto-intent: Automated in- tent discovery and self-exploration for large language model web agents
Jaekyeom Kim, Dong-Ki Kim, Lajanugen Logeswaran, Sun- gryull Sohn, and Honglak Lee. Auto-intent: Automated in- tent discovery and self-exploration for large language model web agents. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 16531–16541, 2024. 3
work page 2024
-
[19]
Hongxin Li, Jingfan Chen, Jingran Su, Yuntao Chen, Qing Li, and Zhaoxiang Zhang. Autogui: Scaling gui grounding with automatic functionality annotations from llms.arXiv preprint arXiv:2502.01977, 2025. 3
-
[20]
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv preprint arXiv:2504.07981,
-
[21]
Appagent v2: Advanced agent for flexible mobile interactions.arXiv preprint arXiv:2408.11824, 2024
Yanda Li, Chi Zhang, Wenjia Jiang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, and Yunchao Wei. Appagent v2: Advanced agent for flexible mobile interactions.arXiv preprint arXiv:2408.11824, 2024. 1 9
-
[22]
Proactive con- versational agents in the post-chatgpt world
Lizi Liao, Grace Hui Yang, and Chirag Shah. Proactive con- versational agents in the post-chatgpt world. InProceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, pages 3452–3455,
-
[23]
Ui-e2i- synth: Advancing gui grounding with large-scale instruction synthesis, 2025
Xinyi Liu, Xiaoyi Zhang, Ziyun Zhang, and Yan Lu. Ui-e2i- synth: Advancing gui grounding with large-scale instruction synthesis, 2025. 3
work page 2025
-
[24]
Proactive conversational agents with inner thoughts
Xingyu Bruce Liu, Shitao Fang, Weiyan Shi, Chien-Sheng Wu, Takeo Igarashi, and Xiang’Anthony’ Chen. Proactive conversational agents with inner thoughts. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–19, 2025. 1
work page 2025
-
[25]
GUI odyssey: A comprehensive dataset for cross-app GUI navigation on mobile devices
Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Box- uan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, and Ping Luo. Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices.arXiv preprint arXiv:2406.08451, 2024. 3, 4
-
[26]
Yaxi Lu, Shenzhi Yang, Cheng Qian, Guirong Chen, Qinyu Luo, Yesai Wu, Huadong Wang, Xin Cong, Zhong Zhang, Yankai Lin, et al. Proactive agent: Shifting llm agents from reactive responses to active assistance.arXiv preprint arXiv:2410.12361, 2024. 1, 3
-
[27]
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing efficient action predic- tion of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620, 2025. 3
-
[28]
Gui-r1 : A generalist r1-style vision- language action model for gui agents, 2025
Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1 : A generalist r1-style vision- language action model for gui agents, 2025. 3
work page 2025
- [29]
-
[30]
Accessed: 2025-08-10. 2, 4, 6
work page 2025
-
[31]
Navigating the unknown: A chat-based collaborative interface for personalized exploratory tasks
Yingzhe Peng, Xiaoting Qin, Zhiyang Zhang, Jue Zhang, Qingwei Lin, Xu Yang, Dongmei Zhang, Saravan Rajmo- han, and Qi Zhang. Navigating the unknown: A chat-based collaborative interface for personalized exploratory tasks. In Proceedings of the 30th International Conference on Intelli- gent User Interfaces, pages 1048–1063, 2025. 1
work page 2025
-
[32]
Tell me more! towards implicit user intention un- derstanding of language model driven agents
Cheng Qian, Bingxiang He, Zhong Zhuang, Jia Deng, Yujia Qin, Xin Cong, Zhong Zhang, Jie Zhou, Yankai Lin, Zhiyuan Liu, et al. Tell me more! towards implicit user intention un- derstanding of language model driven agents. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1088– 1113, 2024. 3
work page 2024
-
[33]
Lee, Giorgio Tran, Nurit Kirshenbaum, and Jason Leigh
Roderick Tabalba, Christopher J. Lee, Giorgio Tran, Nurit Kirshenbaum, and Jason Leigh. Articulatepro: A compar- ative study on a proactive and non-proactive assistant in a climate data exploration task, 2024. 3
work page 2024
-
[34]
Gui-g 2: Gaussian reward modeling for gui ground- ing, 2025
Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Gui-g 2: Gaussian reward modeling for gui ground- ing, 2025. 3
work page 2025
-
[35]
A survey on (m)llm-based gui agents,
Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, and Yueting Zhuang. A survey on (m)llm-based gui agents,
-
[36]
Mimo-vl technical report, 2025
Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhix- ian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huan...
work page 2025
-
[37]
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with ef- fective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710,
-
[38]
Mp-gui: Modality percep- tion with mllms for gui understanding
Ziwei Wang, Weizhi Chen, Leyang Yang, Sheng Zhou, Shengchu Zhao, Hanbei Zhan, Jiongchao Jin, Liangcheng Li, Zirui Shao, and Jiajun Bu. Mp-gui: Modality percep- tion with mllms for gui understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29711–29721, 2025. 3
work page 2025
-
[39]
Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent- e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025. 1
-
[40]
Jinyang Wu, Mingkuan Feng, Shuai Zhang, Feihu Che, Zengqi Wen, Chonghua Liao, and Jianhua Tao. Beyond examples: High-level automated reasoning paradigm in in- context learning via mcts.arXiv preprint arXiv:2411.18478,
-
[41]
Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, and Jianhua Tao. Atlas: Orchestrating heterogeneous models and tools for multi-domain complex reasoning.arXiv preprint arXiv:2601.03872, 2026. 3
-
[42]
Os-atlas: A foundation action model for generalist gui agents, 2024
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024. 3
work page 2024
-
[43]
Mirage-1: Augmenting and updating gui agent with hierar- chical multimodal skills, 2025
Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kai- wen Zhou, Yinchuan Li, Dongmei Jiang, and Liqiang Nie. Mirage-1: Augmenting and updating gui agent with hierar- chical multimodal skills, 2025. 3
work page 2025
-
[44]
Contextagent: Context-aware proac- tive llm agents with open-world sensory perceptions, 2025
Bufang Yang, Lilin Xu, Liekang Zeng, Kaiwei Liu, Siyang Jiang, Wenrui Lu, Hongkai Chen, Xiaofan Jiang, Guoliang 10 Xing, and Zhenyu Yan. Contextagent: Context-aware proac- tive llm agents with open-world sensory perceptions, 2025. 1
work page 2025
-
[45]
Fingertip 20k: A benchmark for proactive and personalized mobile llm agents
Qinglong Yang, Haoming Li, Haotian Zhao, Xiaokai Yan, Jingtao Ding, Fengli Xu, and Yong Li. Fingertip 20k: A benchmark for proactive and personalized mobile llm agents. arXiv preprint arXiv:2507.21071, 2025. 1, 3
-
[46]
Aria-ui: Visual grounding for gui instructions.arXiv preprint arXiv:2412.16256, 2024
Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions.arXiv preprint arXiv:2412.16256, 2024. 3
-
[47]
A survey on agentic multimodal large language models, 2025
Huanjin Yao, Ruifei Zhang, Jiaxing Huang, Jingyi Zhang, Yibo Wang, Bo Fang, Ruolin Zhu, Yongcheng Jing, Shunyu Liu, Guanbin Li, and Dacheng Tao. A survey on agentic multimodal large language models, 2025. 1
work page 2025
-
[48]
A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 1
work page 2024
-
[49]
Large language model-brained gui agents: A survey, 2025
Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Large language model-brained gui agents: A survey, 2025. 3
work page 2025
-
[50]
Mm-llms: Recent ad- vances in multimodal large language models.arXiv preprint arXiv:2401.13601, 2024
Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent ad- vances in multimodal large language models.arXiv preprint arXiv:2401.13601, 2024. 1
-
[51]
Android in the zoo: Chain-of-action-thought for gui agents
Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12016–12031, 2024. 3, 4
work page 2024
-
[52]
Dynamic planning for llm-based graphical user interface automation
Shaoqing Zhang, Zhuosheng Zhang, Kehai Chen, Xinbei Ma, Muyun Yang, Tiejun Zhao, and Min Zhang. Dynamic planning for llm-based graphical user interface automation. InFindings of the Association for Computational Linguis- tics: EMNLP 2024, pages 1304–1320, 2024. 3
work page 2024
-
[53]
Ask-before-plan: Proactive language agents for real-world planning
Xuan Zhang, Yang Deng, Zifeng Ren, See Kiong Ng, and Tat-Seng Chua. Ask-before-plan: Proactive language agents for real-world planning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 10836– 10863, 2024. 3
work page 2024
-
[54]
Zhong Zhang, Yaxi Lu, Yikun Fu, Yupeng Huo, Shenzhi Yang, Yesai Wu, Han Si, Xin Cong, Haotian Chen, Yankai Lin, Jie Xie, Wei Zhou, Wang Xu, Yuanheng Zhang, Zhou Su, Zhongwu Zhai, Xiaoming Liu, Yudong Mei, Jianming Xu, Hongyan Tian, Chongyi Wang, Chi Chen, Yuan Yao, Zhiyuan Liu, and Maosong Sun. AgentCPM-GUI: Build- ing mobile-use agents with reinforcement...
-
[55]
Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, and Mike Zheng Shou. Worldgui: An interactive benchmark for desktop gui automation from any starting point.arXiv preprint arXiv:2502.08047, 2025. 3
-
[56]
Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like train- ing for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025. 1, 3 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.