arxiv: 2604.17817 · v1 · submitted 2026-04-20 · 💻 cs.HC · cs.AI· cs.MA

Recognition: unknown

Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots

Shiquan Zhang , Tianyi Zhang , Le Fang , Simon D'Alfonso , Hong Jia , Vassilis Kostakos

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:39 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.MA

keywords LLM agentssmartphone automationbenchmarkfailure analysisUI accessibilitymultimodal inputsscreentextmobile agents

0 comments

The pith

Text-only inputs work nearly as well as screenshots for LLM smartphone agents, but UI accessibility flaws cause most failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds DailyDroid, a benchmark of 75 everyday tasks across 25 Android apps and three difficulty levels, to test LLM phone automation. It compares text-only screen descriptions against full screenshots on GPT-4o and o4-mini, finding similar success rates with only small gains from images. Failure analysis then shows that apps often fail to expose readable text, layouts confuse the models, and design gaps between LLMs and apps block reliable performance. Readers would care because cheaper text-based agents become viable if these barriers are fixed, yet current phones and apps are not ready for dependable automation.

Core claim

LLM-driven smartphone automation achieves comparable success rates with screentext inputs alone versus multimodal inputs that add screenshots, while the dominant problems stem from insufficient UI accessibility, limitations in input modalities, and mismatches between LLM expectations and app designs.

What carries the argument

The DailyDroid benchmark of 75 tasks spanning five scenarios and three difficulty levels, used to run controlled comparisons of text-only versus text-plus-screenshot performance across 300 trials.

Load-bearing premise

The 75 tasks in DailyDroid sufficiently represent real-world LLM-driven smartphone automation challenges and that the observed failures generalize beyond the tested models and apps.

What would settle it

A larger study with more diverse real-user tasks or additional LLMs that finds substantially higher success rates from multimodal inputs than from text alone would disprove the comparability result.

Figures

Figures reproduced from arXiv: 2604.17817 by Hong Jia, Le Fang, Shiquan Zhang, Simon D'Alfonso, Tianyi Zhang, Vassilis Kostakos.

**Figure 1.** Figure 1: An overview of the Mobile Agent System. It depicts the interaction between the user, the smartphone environment, and the [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of screen representations. (A) Simplified HTML showing the structured text representation of the Google Maps [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Failed Cases of the Text-only modality. (A) In Google Play Books, the red rectangle highlights the reading progress. (B) In [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Example of incorrect UI extraction in an emulator. (A) The simplified HTML output shows only limited elements, missing most [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗

read the original abstract

With the rapid advancement of large language models (LLMs), mobile agents have emerged as promising tools for phone automation, simulating human interactions on screens to accomplish complex tasks. However, these agents often suffer from low accuracy, misinterpretation of user instructions, and failure on challenging tasks, with limited prior work examining why and where they fail. To address this, we introduce DailyDroid, a benchmark of 75 tasks in five scenarios across 25 Android apps, spanning three difficulty levels to mimic everyday smartphone use. We evaluate it using text-only and multimodal (text + screenshot) inputs on GPT-4o and o4-mini across 300 trials, revealing comparable performance with multimodal inputs yielding marginally higher success rates. Through in-depth failure analysis, we compile a handbook of common failures. Our findings reveal critical issues in UI accessibility, input modalities, and LLM/app design, offering implications for future mobile agents, applications, and UI development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

DailyDroid is a solid new benchmark for LLM smartphone agents that shows text-only and multimodal inputs perform about the same, but the task set may not fully represent real-world automation challenges. The paper introduces 75 tasks across five scenarios and 25 Android apps, with three difficulty levels. They test GPT-4o and o4-mini in 300 trials using either just text or text plus screenshots. The results indicate comparable success rates, with multimodal giving only a small boost. They also analyze failures and compile a handbook of common issues like UI accessibility problems and design mismatches. This is new because it provides a structured way to test these agents on daily tasks and documents where they break. The failure analysis is practical and points to specific areas for improvement in apps and models. Running the trials on current models makes the findings timely. The main concern is representativeness. The tasks are chosen to mimic everyday use, but there's no detailed justification for why these five scenarios cover the distribution of actual phone automation needs. Cases involving permissions, background processes, or switching between apps might be under-represented. If so, both the performance comparison and the identified critical issues could be narrower than claimed. This work is for HCI and AI researchers focused on mobile agents and UI automation. People building or evaluating such systems will get concrete examples and a starting point for testing. It deserves peer review because the benchmark and handbook add useful data, even if the scope needs clarification. I recommend sending it to referees, with feedback on justifying the task selection and perhaps adding more varied scenarios in revisions.

Referee Report

3 major / 2 minor

Summary. The paper introduces DailyDroid, a benchmark of 75 tasks across five scenarios in 25 Android apps at three difficulty levels, to study LLM-driven smartphone automation. It evaluates text-only versus multimodal (text + screenshot) inputs on GPT-4o and o4-mini over 300 trials, reporting comparable performance with multimodal inputs showing marginally higher success rates. The work also presents an in-depth failure analysis that compiles a handbook of common failures, highlighting issues in UI accessibility, input modalities, and LLM/app design.

Significance. If the benchmark tasks prove representative, the work offers a practical empirical foundation for understanding failure modes in mobile LLM agents and supplies a reusable failure taxonomy that could directly inform improvements in agent prompting, app UI design, and accessibility features. The focus on everyday scenarios and the explicit comparison of input modalities are timely given the rapid deployment of such agents.

major comments (3)

[§3] §3 (Benchmark Construction): The 75 tasks, five scenarios, and three difficulty levels are presented as mimicking everyday smartphone use, yet the manuscript provides no quantitative sampling justification, coverage metrics (e.g., fraction of permission dialogs, background services, or cross-app handoffs), or comparison against app-store usage statistics. This directly affects the load-bearing claim that observed performance parity and the compiled failure handbook generalize beyond the tested set.
[Abstract and §5] Abstract and §5 (Evaluation): The central result that text-only and multimodal inputs yield 'comparable performance' with multimodal 'marginally higher success rates' is stated without any numerical success rates, per-scenario or per-model breakdowns, error bars, or statistical significance tests. Without these quantities the magnitude and reliability of the reported parity cannot be assessed.
[§6] §6 (Failure Analysis): The handbook of common failures is derived from the 300 trials, but the manuscript does not describe how the taxonomy was constructed, the distribution of failures across categories, or any inter-annotator agreement procedure. This limits the precision and reproducibility of the identified 'critical issues' in UI accessibility and input modalities.

minor comments (2)

[§4] The prompt templates and exact screenshot encoding used for multimodal inputs are not shown; including them (perhaps in an appendix) would improve reproducibility.
[Throughout] A small number of typographical inconsistencies appear in the scenario descriptions; a final proofreading pass is recommended.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and have made revisions to improve clarity, rigor, and reproducibility.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The 75 tasks, five scenarios, and three difficulty levels are presented as mimicking everyday smartphone use, yet the manuscript provides no quantitative sampling justification, coverage metrics (e.g., fraction of permission dialogs, background services, or cross-app handoffs), or comparison against app-store usage statistics. This directly affects the load-bearing claim that observed performance parity and the compiled failure handbook generalize beyond the tested set.

Authors: We acknowledge the absence of explicit quantitative sampling justification and coverage metrics in the original manuscript. Task selection was guided by common everyday smartphone interactions drawn from prior HCI literature and pilot testing, but we agree this should be documented more rigorously. In the revised manuscript, we will add a dedicated subsection in §3 describing the curation process, including rationale for the five scenarios and three difficulty levels, along with available coverage metrics from our task set (e.g., presence of permission flows and cross-app elements). We will also explicitly discuss limitations regarding generalizability to broader app-store distributions, as we did not perform a full statistical comparison against usage logs. revision: yes
Referee: [Abstract and §5] Abstract and §5 (Evaluation): The central result that text-only and multimodal inputs yield 'comparable performance' with multimodal 'marginally higher success rates' is stated without any numerical success rates, per-scenario or per-model breakdowns, error bars, or statistical significance tests. Without these quantities the magnitude and reliability of the reported parity cannot be assessed.

Authors: We appreciate this observation. While detailed results including per-model and per-scenario success rates appear in tables and figures within §5, the abstract and high-level summary in §5 did not include the specific numerical values or statistical details. In the revision, we will update the abstract to report key aggregate success rates (text-only vs. multimodal) and add explicit per-scenario breakdowns, error bars, and statistical significance tests (e.g., McNemar's test or similar) to §5 to substantiate the claims of comparability and marginal improvement. revision: yes
Referee: [§6] §6 (Failure Analysis): The handbook of common failures is derived from the 300 trials, but the manuscript does not describe how the taxonomy was constructed, the distribution of failures across categories, or any inter-annotator agreement procedure. This limits the precision and reproducibility of the identified 'critical issues' in UI accessibility and input modalities.

Authors: We agree that the taxonomy construction process requires more detail. The categories were derived through iterative qualitative analysis of all 300 trial logs and failure cases by the author team, focusing on observable patterns in UI accessibility, reasoning, and modality issues. In the revised §6, we will include: (1) a step-by-step description of the taxonomy development, (2) the distribution of failures across categories with counts or percentages, and (3) clarification on the annotation procedure. As the analysis was performed internally by the core team without multiple independent annotators, we will note the absence of formal inter-annotator agreement metrics as a limitation while emphasizing the systematic categorization approach used. revision: partial

Circularity Check

0 steps flagged

No significant circularity in this empirical benchmark study

full rationale

The paper introduces the DailyDroid benchmark of 75 tasks, runs direct trials on GPT-4o and o4-mini using text-only and multimodal inputs, reports success rates, and compiles a failure handbook from observed outcomes. No equations, derivations, fitted parameters, or predictions appear; results derive from external model executions on the defined tasks rather than any reduction to the paper's own inputs or self-citations. The central claims rest on empirical measurement, not on any self-referential chain that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical evaluation study and relies on standard domain assumptions about task representativeness and generalizability of failures rather than new mathematical constructs.

axioms (1)

domain assumption The 75 tasks across 25 apps and three difficulty levels adequately mimic everyday smartphone automation challenges.
Invoked in the benchmark design description to justify evaluation scope.

pith-pipeline@v0.9.0 · 5489 in / 1354 out tokens · 45920 ms · 2026-05-10T04:39:09.480908+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 33 canonical work pages · 2 internal anchors

[1]

Muna Al-Razgan, Sarah Almoaiqel, Nuha Alrajhi, Alyah Alhumegani, Abeer Alshehri, Bashayr Alnefaie, Raghad AlKhamiss, and Shahad Rushdi
[2]

A systematic literature review on the usability of mobile applications for visually impaired users.PeerJ Computer Science7 (2021), e771

2021
[3]

2014.Alexa

Amazon. 2014.Alexa. https://www.alexa.com Accessed on April 17, 2025

2014
[4]

Android Developers. 2025. AccessibilityService. https://developer.android.com/reference/android/accessibilityservice/AccessibilityService. Accessed on April 13, 2025

2025
[5]

Anthropic. 2024. Introducing computer use.Anthropic News(22 Oct 2024). https://www.anthropic.com/news/3-5-models-and-computer-use Accessed on April 17, 2025

2024
[6]

Anthropic. 2025. Claude: An AI Language Model. https://claude.ai/chats. Accessed: 2025-06-04

2025
[7]

2011.Siri

Apple. 2011.Siri. https://www.apple.com/siri/ Accessed on April 17, 2025. Manuscript submitted to ACM Do LLMs Need to See Everything? 19

2011
[8]

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. 2024. ScreenAI: A Vision-Language Model for UI and Infographics Understanding. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, ...

2024
[9]

Paul Bamberg, Yen-lu Chow, Laurence Gillick, Robert Roth, and Dean Sturtevant. 1990. The Dragon continuous speech recognition system: a real-time implementation. InSpeech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990

1990
[10]

Butterfly Effect. 2025. Manus. https://manus.im/. Accessed: 2025-04-17

2025
[11]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya G. Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2025. Why Do Multi-Agent LLM Systems Fail?CoRRabs/2503.13657 (2025). doi:10.48550/ARXIV.2503.13657 arXiv:2503.13657

work page internal anchor Pith review doi:10.48550/arxiv.2503.13657 2025
[12]

Yuxiang Chai, Hanhao Li, Jiayu Zhang, Liang Liu, Guangyi Liu, Guozhi Wang, Shuai Ren, Siyuan Huang, and Hongsheng Li. 2025. A3: Android agent arena for mobile gui agents.arXiv preprint arXiv:2501.01149(2025)

work page arXiv 2025
[13]

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024. Association for Computatio...

work page doi:10.18653/v1/2024.acl-long.505 2024
[14]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems36 (2023), 28091–28114

2023
[15]

2017.Discovery of grounded theory: Strategies for qualitative research

Barney Glaser and Anselm Strauss. 2017.Discovery of grounded theory: Strategies for qualitative research. Routledge

2017
[16]

2016.Google assistant for android

Google. 2016.Google assistant for android. https://developer.android.com/guide/app-actions/overview Accessed on April 17, 2025

2016
[17]

Google. 2024. Material Design for Android. https://developer.android.com/develop/ui/views/theming/look-and-feel Accessed: 2025-07-14

2024
[18]

2025.Fragments

Google. 2025.Fragments. Android Developers. https://developer.android.com/guide/fragments Official Android Developers guide

2025
[19]

Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust

Izzeddin Gur, Hiroki Furuta, Austin V. Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2024. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net...

2024
[20]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collabo- rative Framework. InThe Twelfth International Conference on Learning Representati...

2024
[21]

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14281–14290

2024
[22]

Simo Hosio, Denzil Ferreira, Jorge Gonçalves, Niels van Berkel, Chu Luo, Muzamil Ahmed, Huber Flores, and Vassilis Kostakos. 2016. Monetary Assessment of Battery Life on Smartphones. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems, San Jose, CA, USA, May 7-12, 2016. ACM, 1869–1880. doi:10.1145/2858036.2858285

work page doi:10.1145/2858036.2858285 2016
[23]

Mohit Jain, Nirmalendu Diwakar, and Manohar Swaminathan. 2021. Smartphone Usage by Expert Blind Users. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems(Yokohama, Japan)(CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 34, 15 pages. doi:10.1145/3411764.3445074

work page doi:10.1145/3411764.3445074 2021
[24]

Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, and Chi Zhang. 2025. AppAgentX: Evolving GUI Agents as Proficient Smartphone Users. arXiv preprint arXiv:2503.02268(2025)

work page arXiv 2025
[25]

Brennan Jones, Yan Xu, Qisheng Li, and Stefan Scherer. 2024. Designing a Proactive Context-Aware AI Chatbot for People’s Long-Term Goals. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA 2024, Honolulu, HI, USA, May 11-16, 2024. ACM, 104:1–104:7. doi:10.1145/3613905.3650912

work page doi:10.1145/3613905.3650912 2024
[26]

Noam Kahlon, Guy Rom, Anatoly Efros, Filippo Galgani, Omri Berkovitch, Sapir Caduri, William E Bishop, Oriana Riva, and Ido Dagan. 2025. Agent-Initiated Interaction in Phone UI Automation. InCompanion Proceedings of the ACM on Web Conference 2025. 2391–2400

2025
[27]

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. Language models can solve computer tasks.Advances in Neural Information Processing Systems36 (2023), 39648–39677

2023
[28]

Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steven Y Ko, Sangeun Oh, and Insik Shin. 2023. Explore, select, derive, and recall: Augmenting llm with human-like memory for mobile task automation.arXiv preprint arXiv:2312.03003(2023)

work page arXiv 2023
[29]

Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. 2020. Mapping Natural Language Instructions to Mobile UI Action Sequences. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 8198–8210. doi:10.18653/V1/2020.ACL-MAIN.729

work page doi:10.18653/v1/2020.acl-main.729 2020
[30]

Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. 2024. Personal llm agents: Insights and survey about the capability, efficiency and security.arXiv preprint arXiv:2401.05459(2024)

work page arXiv 2024
[31]

Yanda Li, Chi Zhang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, and Yunchao Wei. 2024. Appagent v2: Advanced agent for flexible mobile interactions.arXiv preprint arXiv:2408.11824(2024)

work page arXiv 2024
[32]

Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2025. Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.n...

2025
[33]

Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, et al . 2025. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990(2025)

work page arXiv 2025
[34]

Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, Wenhao Wang, Tianze Wu, Linghao Li, Hao Wang, Guanjing Xiong, Yong Liu, and Hongsheng Li. 2025. LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects.CoRRabs/2504.19838 (2025). doi:10.48550/ARXIV.2504....

work page doi:10.48550/arxiv.2504.19838 2025
[35]

Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, and Vikas Chandra

Zechun Liu, Changsheng Zhao, Forrest N. Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, and Vikas Chandra. 2024. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. InForty-first International Conference on Machine Learning, ICML 2024, Vien...

2024
[36]

William Merrill, Jackson Petty, and Ashish Sabharwal. 2024. The Illusion of State in State-Space Models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net. https://openreview.net/forum?id=QZgo9JZpLq

2024
[37]

2023.Microsoft Copilot

Microsoft. 2023.Microsoft Copilot. https://copilot.microsoft.com/ Accessed: July 9, 2025

2023
[38]

With most of it being pictures now, I rarely use it

Meredith Ringel Morris, Annuska Zolyomi, Catherine Yao, Sina Bahram, Jeffrey P Bigham, and Shaun K Kane. 2016. " With most of it being pictures now, I rarely use it" Understanding Twitter’s Evolving Accessibility to Blind Users. InProceedings of the 2016 CHI conference on human factors in computing systems. 5506–5516

2016
[39]

Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, and Qi Wang. 2024. ScreenAgent: A Vision Language Model-driven Computer Control Agent. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3-9, 2024. ijcai.org, 6433–6441. https://www.ijcai...

2024
[40]

2022.ChatGPT

OpenAI. 2022.ChatGPT. https://chatgpt.com/ Accessed: July 9, 2025

2022
[41]

OpenAI. 2025. Computer-Using Agent. https://openai.com/index/computer-using-agent/. Accessed: 2025-03-26

2025
[42]

OpenAI. 2025. OpenAI Model priciing. https://openai.com/api/pricing/. Accessed: 2025-06-29

2025
[43]

Lihang Pan, Bowen Wang, Chun Yu, Yuxuan Chen, Xiangyu Zhang, and Yuanchun Shi. 2023. Autotask: Executing arbitrary voice commands by exploring and learning from mobile gui.arXiv preprint arXiv:2312.16062(2023)

work page arXiv 2023
[44]

Melissa Z Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, et al. 2025. Measuring agents in production.arXiv preprint arXiv:2512.04123(2025)

work page arXiv 2025
[45]

Sanket Pandya. 2024. Android Material Design Guidelines.Design Bootcamp on Medium(20 jun 2024). https://medium.com/design-bootcamp/android- material-design-guidelines-4cd9b3a3b454 Accessed: 2025-07-14

2024
[46]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

2023
[47]

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. 2024. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573(2024)

work page internal anchor Pith review arXiv 2024
[48]

Byron Reeves, Thomas Robinson, and Nilam Ram. 2020. Time for the human screenome project.Nature577, 7790 (2020), 314–317

2020
[49]

André Rodrigues, Kyle Montague, Hugo Nicolau, João Guerreiro, and Tiago Guerreiro. 2017. In-context Q&A to support blind people using smartphones. InProceedings of the 19th international ACM SIGACCESS conference on computers and accessibility. 32–36

2017
[50]

Anne Spencer Ross, Xiaoyi Zhang, James Fogarty, and Jacob O Wobbrock. 2018. Examining image-based button labeling for accessibility in Android apps through large-scale analysis. InProceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility. 119–130

2018
[51]

Sarah Schömbs, Yan Zhang, Jorge Gonçalves, and Wafa Johal. 2025. From Conversation to Orchestration: HCI Challenges and Opportunities in Interactive Multi-Agentic Systems.CoRRabs/2506.20091 (2025). doi:10.48550/ARXIV.2506.20091 arXiv:2506.20091

work page doi:10.48550/arxiv.2506.20091 2025
[52]

Valentin Schwind, Stefan Resch, and Jessica Sehrt. 2023. The HCI User Studies Toolkit: Supporting Study Designing and Planning for Undergraduates and Novice Researchers in Human-Computer Interaction. InExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. 1–7

2023
[53]

ScientificAmerican. 2026. The Her Talking Phone. https://www.scientificamerican.com/article/bytedance-launches-doubao-real-time-ai-voice- assistant-for-phones/

2026
[54]

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina N Toutanova. 2023. From pixels to ui actions: Learning to follow instructions via graphical user interfaces.Advances in Neural Information Processing Systems36 (2023), 34354–34370

2023
[55]

Maayan Shvo, Zhiming Hu, Rodrigo Toro Icarte, Iqbal Mohomed, Allan D Jepson, and Sheila A McIlraith. 2021. AppBuddy: Learning to Accomplish Tasks in Mobile Apps via Reinforcement Learning.. InCanadian AI

2021
[56]

Yunpeng Song, Yiheng Bian, Yongtao Tang, Guiyu Ma, and Zhongmin Cai. 2024. VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machinery, New York, NY, USA, Article 4...

work page doi:10.1145/3654777.3676386 2024
[57]

Yunpeng Song, Yiheng Bian, Yongtao Tang, Guiyu Ma, and Zhongmin Cai. 2024. Visiontasker: Mobile task automation using vision based ui understanding and llm task planning. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–17

2024
[58]

Sonix. 2024. A Deep Dive into the History of Speech Recognition Technology. https://sonix.ai/history-of-speech-recognition. Last updated: June 25,

2024
[59]

Manuscript submitted to ACM Do LLMs Need to See Everything? 21

Accessed: July 9, 2025. Manuscript submitted to ACM Do LLMs Need to See Everything? 21

2025
[60]

Songyan Teng, Simon D’Alfonso, and Vassilis Kostakos. 2024. A Tool for Capturing Smartphone Screen Text. InProceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024. ACM, 938:1–938:24. doi:10.1145/3613904.3642347

work page doi:10.1145/3613904.3642347 2024
[61]

Sagar Gubbi Venkatesh, Partha Talukdar, and Srini Narayanan. 2022. Ugif: Ui grounded instruction following.arXiv preprint arXiv:2211.07615(2022)

work page arXiv 2022
[62]

W3C. 2015. Mobile Accessibility: How WCAG 2.0 and Other W3C/WAI Guideline Apply to Mobile. https://www.w3.org/TR/mobile-accessibility- mapping/. Accessed on August 3, 2025

2015
[63]

Bryan Wang, Gang Li, and Yang Li. 2023. Enabling Conversational Interaction with Mobile UI using Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, April 23-28, 2023. ACM, 432:1–432:17. doi:10.1145/ 3544548.3580895

work page arXiv 2023
[64]

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,...

work page arXiv 2024
[65]

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158(2024)

work page arXiv 2024
[66]

Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, and Shoufa Chen. 2024. MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents.CoRRabs/2406.08184 (2024). doi:10.48550/ARXIV.2406.08184 arXiv:2406.08184

work page doi:10.48550/arxiv.2406.08184 2024
[67]

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

2024
[68]

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. Autodroid: Llm-powered task automation in android. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 543–557

2024
[69]

Hao Wen, Hongming Wang, Jiaxuan Liu, and Yuanchun Li. 2023. DroidBot-GPT: GPT-powered UI Automation for Android.CoRRabs/2304.07061 (2023). doi:10.48550/ARXIV.2304.07061 arXiv:2304.07061

work page doi:10.48550/arxiv.2304.07061 2023
[70]

Lilian Weng. 2023. LLM-powered Autonomous Agents.lilianweng.github.io(Jun 2023). https://lilianweng.github.io/posts/2023-06-23-agent/

2023
[71]

Biao Wu, Yanda Li, Meng Fang, Zirui Song, Zhiwei Zhang, Yunchao Wei, and Ling Chen. 2024. Foundations and Recent Trends in Multimodal Mobile Agents: A Survey.CoRRabs/2411.02006 (2024). https://doi.org/10.48550/arXiv.2411.02006

work page doi:10.48550/arxiv.2411.02006 2024
[72]

Xingjiao Wu, Luwei Xiao, Yixuan Sun, Junhang Zhang, Tianlong Ma, and Liang He. 2022. A survey of human-in-the-loop for machine learning. Future Generation Computer Systems135 (2022), 364–381

2022
[73]

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al
[74]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems37 (2024), 52040–52094

2024
[75]

Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, and Ziyuan Ling. 2024. On-Device Language Models: A Comprehensive Review. arXiv:2409.00088 [cs.CL] https://arxiv.org/abs/2409.00088

work page arXiv 2024
[76]

Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. 2024. Androidlab: Training and systematic benchmarking of android autonomous agents.arXiv preprint arXiv:2410.24024(2024)

work page arXiv 2024
[77]

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2025. AppAgent: Multimodal Agents as Smartphone Users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI 2025, YokohamaJapan, 26 April 2025- 1 May 2025. ACM, 70:1–70:20. doi:10.1145/3706598.3713600

work page doi:10.1145/3706598.3713600 2025
[78]

Shiquan Zhang, Ying Ma, Le Fang, Hong Jia, Simon D’Alfonso, and Vassilis Kostakos. 2024. Enabling on-device llms personalization with smartphone sensing. InCompanion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing. 186–190

2024
[79]

Tianyi Zhang, Miu Kojima, and Simon D’Alfonso. 2024. AWARE Narrator and the Utilization of Large Language Models to Extract Behavioral Insights from Smartphone Sensing Data.arXiv preprint arXiv:2411.04691(2024)

work page arXiv 2024
[80]

Tianyi Zhang, Shiquan Zhang, Le Fang, Hong Jia, Vassilis Kostakos, and Simon D’Alfonso. 2024. AutoJournaling: A Context-Aware Journaling System Leveraging MLLMs on Smartphone Screenshots. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 2347–2352

2024

Showing first 80 references.