Recognition: unknown
Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots
Pith reviewed 2026-05-10 04:39 UTC · model grok-4.3
The pith
Text-only inputs work nearly as well as screenshots for LLM smartphone agents, but UI accessibility flaws cause most failures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-driven smartphone automation achieves comparable success rates with screentext inputs alone versus multimodal inputs that add screenshots, while the dominant problems stem from insufficient UI accessibility, limitations in input modalities, and mismatches between LLM expectations and app designs.
What carries the argument
The DailyDroid benchmark of 75 tasks spanning five scenarios and three difficulty levels, used to run controlled comparisons of text-only versus text-plus-screenshot performance across 300 trials.
Load-bearing premise
The 75 tasks in DailyDroid sufficiently represent real-world LLM-driven smartphone automation challenges and that the observed failures generalize beyond the tested models and apps.
What would settle it
A larger study with more diverse real-user tasks or additional LLMs that finds substantially higher success rates from multimodal inputs than from text alone would disprove the comparability result.
Figures
read the original abstract
With the rapid advancement of large language models (LLMs), mobile agents have emerged as promising tools for phone automation, simulating human interactions on screens to accomplish complex tasks. However, these agents often suffer from low accuracy, misinterpretation of user instructions, and failure on challenging tasks, with limited prior work examining why and where they fail. To address this, we introduce DailyDroid, a benchmark of 75 tasks in five scenarios across 25 Android apps, spanning three difficulty levels to mimic everyday smartphone use. We evaluate it using text-only and multimodal (text + screenshot) inputs on GPT-4o and o4-mini across 300 trials, revealing comparable performance with multimodal inputs yielding marginally higher success rates. Through in-depth failure analysis, we compile a handbook of common failures. Our findings reveal critical issues in UI accessibility, input modalities, and LLM/app design, offering implications for future mobile agents, applications, and UI development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DailyDroid, a benchmark of 75 tasks across five scenarios in 25 Android apps at three difficulty levels, to study LLM-driven smartphone automation. It evaluates text-only versus multimodal (text + screenshot) inputs on GPT-4o and o4-mini over 300 trials, reporting comparable performance with multimodal inputs showing marginally higher success rates. The work also presents an in-depth failure analysis that compiles a handbook of common failures, highlighting issues in UI accessibility, input modalities, and LLM/app design.
Significance. If the benchmark tasks prove representative, the work offers a practical empirical foundation for understanding failure modes in mobile LLM agents and supplies a reusable failure taxonomy that could directly inform improvements in agent prompting, app UI design, and accessibility features. The focus on everyday scenarios and the explicit comparison of input modalities are timely given the rapid deployment of such agents.
major comments (3)
- [§3] §3 (Benchmark Construction): The 75 tasks, five scenarios, and three difficulty levels are presented as mimicking everyday smartphone use, yet the manuscript provides no quantitative sampling justification, coverage metrics (e.g., fraction of permission dialogs, background services, or cross-app handoffs), or comparison against app-store usage statistics. This directly affects the load-bearing claim that observed performance parity and the compiled failure handbook generalize beyond the tested set.
- [Abstract and §5] Abstract and §5 (Evaluation): The central result that text-only and multimodal inputs yield 'comparable performance' with multimodal 'marginally higher success rates' is stated without any numerical success rates, per-scenario or per-model breakdowns, error bars, or statistical significance tests. Without these quantities the magnitude and reliability of the reported parity cannot be assessed.
- [§6] §6 (Failure Analysis): The handbook of common failures is derived from the 300 trials, but the manuscript does not describe how the taxonomy was constructed, the distribution of failures across categories, or any inter-annotator agreement procedure. This limits the precision and reproducibility of the identified 'critical issues' in UI accessibility and input modalities.
minor comments (2)
- [§4] The prompt templates and exact screenshot encoding used for multimodal inputs are not shown; including them (perhaps in an appendix) would improve reproducibility.
- [Throughout] A small number of typographical inconsistencies appear in the scenario descriptions; a final proofreading pass is recommended.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and have made revisions to improve clarity, rigor, and reproducibility.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The 75 tasks, five scenarios, and three difficulty levels are presented as mimicking everyday smartphone use, yet the manuscript provides no quantitative sampling justification, coverage metrics (e.g., fraction of permission dialogs, background services, or cross-app handoffs), or comparison against app-store usage statistics. This directly affects the load-bearing claim that observed performance parity and the compiled failure handbook generalize beyond the tested set.
Authors: We acknowledge the absence of explicit quantitative sampling justification and coverage metrics in the original manuscript. Task selection was guided by common everyday smartphone interactions drawn from prior HCI literature and pilot testing, but we agree this should be documented more rigorously. In the revised manuscript, we will add a dedicated subsection in §3 describing the curation process, including rationale for the five scenarios and three difficulty levels, along with available coverage metrics from our task set (e.g., presence of permission flows and cross-app elements). We will also explicitly discuss limitations regarding generalizability to broader app-store distributions, as we did not perform a full statistical comparison against usage logs. revision: yes
-
Referee: [Abstract and §5] Abstract and §5 (Evaluation): The central result that text-only and multimodal inputs yield 'comparable performance' with multimodal 'marginally higher success rates' is stated without any numerical success rates, per-scenario or per-model breakdowns, error bars, or statistical significance tests. Without these quantities the magnitude and reliability of the reported parity cannot be assessed.
Authors: We appreciate this observation. While detailed results including per-model and per-scenario success rates appear in tables and figures within §5, the abstract and high-level summary in §5 did not include the specific numerical values or statistical details. In the revision, we will update the abstract to report key aggregate success rates (text-only vs. multimodal) and add explicit per-scenario breakdowns, error bars, and statistical significance tests (e.g., McNemar's test or similar) to §5 to substantiate the claims of comparability and marginal improvement. revision: yes
-
Referee: [§6] §6 (Failure Analysis): The handbook of common failures is derived from the 300 trials, but the manuscript does not describe how the taxonomy was constructed, the distribution of failures across categories, or any inter-annotator agreement procedure. This limits the precision and reproducibility of the identified 'critical issues' in UI accessibility and input modalities.
Authors: We agree that the taxonomy construction process requires more detail. The categories were derived through iterative qualitative analysis of all 300 trial logs and failure cases by the author team, focusing on observable patterns in UI accessibility, reasoning, and modality issues. In the revised §6, we will include: (1) a step-by-step description of the taxonomy development, (2) the distribution of failures across categories with counts or percentages, and (3) clarification on the annotation procedure. As the analysis was performed internally by the core team without multiple independent annotators, we will note the absence of formal inter-annotator agreement metrics as a limitation while emphasizing the systematic categorization approach used. revision: partial
Circularity Check
No significant circularity in this empirical benchmark study
full rationale
The paper introduces the DailyDroid benchmark of 75 tasks, runs direct trials on GPT-4o and o4-mini using text-only and multimodal inputs, reports success rates, and compiles a failure handbook from observed outcomes. No equations, derivations, fitted parameters, or predictions appear; results derive from external model executions on the defined tasks rather than any reduction to the paper's own inputs or self-citations. The central claims rest on empirical measurement, not on any self-referential chain that collapses by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 75 tasks across 25 apps and three difficulty levels adequately mimic everyday smartphone automation challenges.
Reference graph
Works this paper leans on
-
[1]
Muna Al-Razgan, Sarah Almoaiqel, Nuha Alrajhi, Alyah Alhumegani, Abeer Alshehri, Bashayr Alnefaie, Raghad AlKhamiss, and Shahad Rushdi
-
[2]
A systematic literature review on the usability of mobile applications for visually impaired users.PeerJ Computer Science7 (2021), e771
2021
-
[3]
2014.Alexa
Amazon. 2014.Alexa. https://www.alexa.com Accessed on April 17, 2025
2014
-
[4]
Android Developers. 2025. AccessibilityService. https://developer.android.com/reference/android/accessibilityservice/AccessibilityService. Accessed on April 13, 2025
2025
-
[5]
Anthropic. 2024. Introducing computer use.Anthropic News(22 Oct 2024). https://www.anthropic.com/news/3-5-models-and-computer-use Accessed on April 17, 2025
2024
-
[6]
Anthropic. 2025. Claude: An AI Language Model. https://claude.ai/chats. Accessed: 2025-06-04
2025
-
[7]
2011.Siri
Apple. 2011.Siri. https://www.apple.com/siri/ Accessed on April 17, 2025. Manuscript submitted to ACM Do LLMs Need to See Everything? 19
2011
-
[8]
Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. 2024. ScreenAI: A Vision-Language Model for UI and Infographics Understanding. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, ...
2024
-
[9]
Paul Bamberg, Yen-lu Chow, Laurence Gillick, Robert Roth, and Dean Sturtevant. 1990. The Dragon continuous speech recognition system: a real-time implementation. InSpeech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990
1990
-
[10]
Butterfly Effect. 2025. Manus. https://manus.im/. Accessed: 2025-04-17
2025
-
[11]
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya G. Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2025. Why Do Multi-Agent LLM Systems Fail?CoRRabs/2503.13657 (2025). doi:10.48550/ARXIV.2503.13657 arXiv:2503.13657
work page internal anchor Pith review doi:10.48550/arxiv.2503.13657 2025
- [12]
-
[13]
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024. Association for Computatio...
-
[14]
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems36 (2023), 28091–28114
2023
-
[15]
2017.Discovery of grounded theory: Strategies for qualitative research
Barney Glaser and Anselm Strauss. 2017.Discovery of grounded theory: Strategies for qualitative research. Routledge
2017
-
[16]
2016.Google assistant for android
Google. 2016.Google assistant for android. https://developer.android.com/guide/app-actions/overview Accessed on April 17, 2025
2016
-
[17]
Google. 2024. Material Design for Android. https://developer.android.com/develop/ui/views/theming/look-and-feel Accessed: 2025-07-14
2024
-
[18]
2025.Fragments
Google. 2025.Fragments. Android Developers. https://developer.android.com/guide/fragments Official Android Developers guide
2025
-
[19]
Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust
Izzeddin Gur, Hiroki Furuta, Austin V. Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2024. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net...
2024
-
[20]
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collabo- rative Framework. InThe Twelfth International Conference on Learning Representati...
2024
-
[21]
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14281–14290
2024
-
[22]
Simo Hosio, Denzil Ferreira, Jorge Gonçalves, Niels van Berkel, Chu Luo, Muzamil Ahmed, Huber Flores, and Vassilis Kostakos. 2016. Monetary Assessment of Battery Life on Smartphones. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems, San Jose, CA, USA, May 7-12, 2016. ACM, 1869–1880. doi:10.1145/2858036.2858285
-
[23]
Mohit Jain, Nirmalendu Diwakar, and Manohar Swaminathan. 2021. Smartphone Usage by Expert Blind Users. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems(Yokohama, Japan)(CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 34, 15 pages. doi:10.1145/3411764.3445074
- [24]
-
[25]
Brennan Jones, Yan Xu, Qisheng Li, and Stefan Scherer. 2024. Designing a Proactive Context-Aware AI Chatbot for People’s Long-Term Goals. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA 2024, Honolulu, HI, USA, May 11-16, 2024. ACM, 104:1–104:7. doi:10.1145/3613905.3650912
-
[26]
Noam Kahlon, Guy Rom, Anatoly Efros, Filippo Galgani, Omri Berkovitch, Sapir Caduri, William E Bishop, Oriana Riva, and Ido Dagan. 2025. Agent-Initiated Interaction in Phone UI Automation. InCompanion Proceedings of the ACM on Web Conference 2025. 2391–2400
2025
-
[27]
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. Language models can solve computer tasks.Advances in Neural Information Processing Systems36 (2023), 39648–39677
2023
- [28]
-
[29]
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. 2020. Mapping Natural Language Instructions to Mobile UI Action Sequences. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 8198–8210. doi:10.18653/V1/2020.ACL-MAIN.729
- [30]
- [31]
-
[32]
Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2025. Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.n...
2025
-
[33]
Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, et al . 2025. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990(2025)
-
[34]
Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, Wenhao Wang, Tianze Wu, Linghao Li, Hao Wang, Guanjing Xiong, Yong Liu, and Hongsheng Li. 2025. LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects.CoRRabs/2504.19838 (2025). doi:10.48550/ARXIV.2504....
-
[35]
Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, and Vikas Chandra
Zechun Liu, Changsheng Zhao, Forrest N. Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, and Vikas Chandra. 2024. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. InForty-first International Conference on Machine Learning, ICML 2024, Vien...
2024
-
[36]
William Merrill, Jackson Petty, and Ashish Sabharwal. 2024. The Illusion of State in State-Space Models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net. https://openreview.net/forum?id=QZgo9JZpLq
2024
-
[37]
2023.Microsoft Copilot
Microsoft. 2023.Microsoft Copilot. https://copilot.microsoft.com/ Accessed: July 9, 2025
2023
-
[38]
With most of it being pictures now, I rarely use it
Meredith Ringel Morris, Annuska Zolyomi, Catherine Yao, Sina Bahram, Jeffrey P Bigham, and Shaun K Kane. 2016. " With most of it being pictures now, I rarely use it" Understanding Twitter’s Evolving Accessibility to Blind Users. InProceedings of the 2016 CHI conference on human factors in computing systems. 5506–5516
2016
-
[39]
Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, and Qi Wang. 2024. ScreenAgent: A Vision Language Model-driven Computer Control Agent. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3-9, 2024. ijcai.org, 6433–6441. https://www.ijcai...
2024
-
[40]
2022.ChatGPT
OpenAI. 2022.ChatGPT. https://chatgpt.com/ Accessed: July 9, 2025
2022
-
[41]
OpenAI. 2025. Computer-Using Agent. https://openai.com/index/computer-using-agent/. Accessed: 2025-03-26
2025
-
[42]
OpenAI. 2025. OpenAI Model priciing. https://openai.com/api/pricing/. Accessed: 2025-06-29
2025
- [43]
- [44]
-
[45]
Sanket Pandya. 2024. Android Material Design Guidelines.Design Bootcamp on Medium(20 jun 2024). https://medium.com/design-bootcamp/android- material-design-guidelines-4cd9b3a3b454 Accessed: 2025-07-14
2024
-
[46]
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22
2023
-
[47]
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. 2024. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573(2024)
work page internal anchor Pith review arXiv 2024
-
[48]
Byron Reeves, Thomas Robinson, and Nilam Ram. 2020. Time for the human screenome project.Nature577, 7790 (2020), 314–317
2020
-
[49]
André Rodrigues, Kyle Montague, Hugo Nicolau, João Guerreiro, and Tiago Guerreiro. 2017. In-context Q&A to support blind people using smartphones. InProceedings of the 19th international ACM SIGACCESS conference on computers and accessibility. 32–36
2017
-
[50]
Anne Spencer Ross, Xiaoyi Zhang, James Fogarty, and Jacob O Wobbrock. 2018. Examining image-based button labeling for accessibility in Android apps through large-scale analysis. InProceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility. 119–130
2018
-
[51]
Sarah Schömbs, Yan Zhang, Jorge Gonçalves, and Wafa Johal. 2025. From Conversation to Orchestration: HCI Challenges and Opportunities in Interactive Multi-Agentic Systems.CoRRabs/2506.20091 (2025). doi:10.48550/ARXIV.2506.20091 arXiv:2506.20091
-
[52]
Valentin Schwind, Stefan Resch, and Jessica Sehrt. 2023. The HCI User Studies Toolkit: Supporting Study Designing and Planning for Undergraduates and Novice Researchers in Human-Computer Interaction. InExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. 1–7
2023
-
[53]
ScientificAmerican. 2026. The Her Talking Phone. https://www.scientificamerican.com/article/bytedance-launches-doubao-real-time-ai-voice- assistant-for-phones/
2026
-
[54]
Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina N Toutanova. 2023. From pixels to ui actions: Learning to follow instructions via graphical user interfaces.Advances in Neural Information Processing Systems36 (2023), 34354–34370
2023
-
[55]
Maayan Shvo, Zhiming Hu, Rodrigo Toro Icarte, Iqbal Mohomed, Allan D Jepson, and Sheila A McIlraith. 2021. AppBuddy: Learning to Accomplish Tasks in Mobile Apps via Reinforcement Learning.. InCanadian AI
2021
-
[56]
Yunpeng Song, Yiheng Bian, Yongtao Tang, Guiyu Ma, and Zhongmin Cai. 2024. VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machinery, New York, NY, USA, Article 4...
-
[57]
Yunpeng Song, Yiheng Bian, Yongtao Tang, Guiyu Ma, and Zhongmin Cai. 2024. Visiontasker: Mobile task automation using vision based ui understanding and llm task planning. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–17
2024
-
[58]
Sonix. 2024. A Deep Dive into the History of Speech Recognition Technology. https://sonix.ai/history-of-speech-recognition. Last updated: June 25,
2024
-
[59]
Manuscript submitted to ACM Do LLMs Need to See Everything? 21
Accessed: July 9, 2025. Manuscript submitted to ACM Do LLMs Need to See Everything? 21
2025
-
[60]
Songyan Teng, Simon D’Alfonso, and Vassilis Kostakos. 2024. A Tool for Capturing Smartphone Screen Text. InProceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024. ACM, 938:1–938:24. doi:10.1145/3613904.3642347
- [61]
-
[62]
W3C. 2015. Mobile Accessibility: How WCAG 2.0 and Other W3C/WAI Guideline Apply to Mobile. https://www.w3.org/TR/mobile-accessibility- mapping/. Accessed on August 3, 2025
2015
-
[63]
Bryan Wang, Gang Li, and Yang Li. 2023. Enabling Conversational Interaction with Mobile UI using Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, April 23-28, 2023. ACM, 432:1–432:17. doi:10.1145/ 3544548.3580895
-
[64]
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,...
- [65]
-
[66]
Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, and Shoufa Chen. 2024. MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents.CoRRabs/2406.08184 (2024). doi:10.48550/ARXIV.2406.08184 arXiv:2406.08184
-
[67]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345
2024
-
[68]
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. Autodroid: Llm-powered task automation in android. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 543–557
2024
-
[69]
Hao Wen, Hongming Wang, Jiaxuan Liu, and Yuanchun Li. 2023. DroidBot-GPT: GPT-powered UI Automation for Android.CoRRabs/2304.07061 (2023). doi:10.48550/ARXIV.2304.07061 arXiv:2304.07061
-
[70]
Lilian Weng. 2023. LLM-powered Autonomous Agents.lilianweng.github.io(Jun 2023). https://lilianweng.github.io/posts/2023-06-23-agent/
2023
-
[71]
Biao Wu, Yanda Li, Meng Fang, Zirui Song, Zhiwei Zhang, Yunchao Wei, and Ling Chen. 2024. Foundations and Recent Trends in Multimodal Mobile Agents: A Survey.CoRRabs/2411.02006 (2024). https://doi.org/10.48550/arXiv.2411.02006
-
[72]
Xingjiao Wu, Luwei Xiao, Yixuan Sun, Junhang Zhang, Tianlong Ma, and Liang He. 2022. A survey of human-in-the-loop for machine learning. Future Generation Computer Systems135 (2022), 364–381
2022
-
[73]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al
-
[74]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems37 (2024), 52040–52094
2024
- [75]
- [76]
-
[77]
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2025. AppAgent: Multimodal Agents as Smartphone Users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI 2025, YokohamaJapan, 26 April 2025- 1 May 2025. ACM, 70:1–70:20. doi:10.1145/3706598.3713600
-
[78]
Shiquan Zhang, Ying Ma, Le Fang, Hong Jia, Simon D’Alfonso, and Vassilis Kostakos. 2024. Enabling on-device llms personalization with smartphone sensing. InCompanion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing. 186–190
2024
- [79]
-
[80]
Tianyi Zhang, Shiquan Zhang, Le Fang, Hong Jia, Vassilis Kostakos, and Simon D’Alfonso. 2024. AutoJournaling: A Context-Aware Journaling System Leveraging MLLMs on Smartphone Screenshots. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 2347–2352
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.