pith. sign in

arxiv: 2505.03364 · v2 · submitted 2025-05-06 · 💻 cs.HC

DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking

Pith reviewed 2026-05-22 16:57 UTC · model grok-4.3

classification 💻 cs.HC
keywords mobile agentsinformation seekingtransparent automationsteerable systemsmulti-LLM pipelineprogress dashboardcross-app navigationuser intervention
0
0 comments X

The pith

DroidRetriever uses a multi-LLM pipeline and live dashboard to let users monitor, steer, and intervene in cross-app mobile searches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DroidRetriever as a system that accepts a query, breaks it into sub-tasks with language models, navigates apps, captures screenshots, and assembles a report while showing the entire process to the user. A dashboard displays sub-task status alongside maps of explored content so people can take over at any moment or approve actions on private screens. The approach is evaluated on 35 tasks spanning 24 apps, where it produced higher coverage, clearer visibility into the work, and lower effort than prior mobile agents. If the core mechanisms hold, fragmented mobile information gathering could shift from repeated context switches and manual re-entry to a guided, interruptible collaboration between user and automation.

Core claim

DroidRetriever accepts voice or typed input and employs a multi-LLM system to decompose tasks, navigate target pages, take screenshots, and synthesize concise reports with citation-linked screenshots; transparency is achieved through a progress dashboard that combines sub-task status with real-time exploration maps, allowing seamless user takeover, while the system pauses on detected privacy or high-risk screens to prompt intervention.

What carries the argument

The progress dashboard that merges sub-task progress indicators with real-time exploration maps, backed by the multi-LLM pipeline for task decomposition, navigation, screenshot capture, and report synthesis.

If this is right

  • Final reports include citation-linked screenshots that let users verify each piece of information against its source screen.
  • The system pauses automatically before displaying or acting on privacy-sensitive or high-risk content.
  • Users avoid repetitive context switching and data re-entry because the automation maintains state across apps.
  • Overall coverage rises as the system systematically explores multiple sources while the dashboard keeps the user informed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same combination of visible maps and pause points could reduce opacity in automation tools built for desktop or web environments.
  • Repeated successful handoffs between agent and user may encourage designers to add explicit takeover features to other personal-data agents.
  • Over time the dashboard data could reveal common intervention patterns that inform better default behaviors for future versions.

Load-bearing premise

The multi-LLM system can reliably decompose queries, navigate through diverse apps, and produce accurate reports without frequent errors or getting stuck.

What would settle it

If evaluation on the 35 tasks shows frequent navigation failures, incomplete reports, or no measurable drop in user workload and context switching, the claimed improvements would not hold.

Figures

Figures reproduced from arXiv: 2505.03364 by Guiyu Ma, Rongrong Zhu, Yiheng Bian, Yunpeng Song, Zhongmin Cai.

Figure 1
Figure 1. Figure 1: Comparison between DroidRetriever (left) and the general-purpose LLM-driven agent (right). DroidRetriever consists [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Automated Workflow of DroidRetriever (excluding manual intervention). It includes 3 modules: task decomposi [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Page-level decomposition, showing focused mode, list-view mode, and multi-page mode. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Intervention mechanisms: 𝐼𝑛𝑡𝑒𝑟𝑣𝑒𝑛𝑡𝑖𝑜𝑛𝑎 requires gesture operations during intervention, such as tapping and text input, and also includes intervention for proactive alerts on privacy-sensitive operations and high-risk actions. 𝐼𝑛𝑡𝑒𝑟𝑣𝑒𝑛𝑡𝑖𝑜𝑛𝑏 lets the user take a screenshot and save the current interface to the search results database. 𝐼𝑛𝑡𝑒𝑟𝑣𝑒𝑛𝑡𝑖𝑜𝑛𝑐 signifies the intention to terminate the UI copilot [PITH_… view at source ↗
Figure 5
Figure 5. Figure 5: User interfaces: (a) shows the intervention widget: (a-1) intervention - interrupt and take over, (a-2) tap to return [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Results of Study 1: (a) Coverage, accuracy, and redundancy rates for manual vs. system-generated reports; (b) Overall quality ratings. ↓ indicates lower is better. *** indicates a significant difference with 𝑝 < .001, while ** indicates significance with 𝑝 < .01. formal report, and could record in any format to reflect real-life "capture-and-notes" habits. The platform supported copying any content from th… view at source ↗
Figure 7
Figure 7. Figure 7: Results of Study 2, including task decomposition and a comparative evaluation of Human, DroidRetriever, LLM-driven search engines (Qwen & ChatGPT), Claude Computer Use, and Mobile-Agent-v2. (a) Page-level decomposition confusion matrix. (b) Ratio of user-intervention time to total task duration for four intervention types and overall. (c) Task-wise and Step-wise Intervention Rates for DroidRetriever.(d-e) … view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of scrolling screenshot. As shown in [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

Information seeking on mobile devices is often fragmented, trapping users in repetitive cycles of context switching and data re-entry, which increases cognitive load and disrupts workflow. Existing mobile agents provide limited cross-source integration and are largely opaque, presenting progress as a linear feed with few opportunities to intervene, steer, or take control. We present DroidRetriever, a transparent, steerable system for cross-source mobile information seeking. It accepts voice or typed input and the multi-LLM system decomposes the task, navigates to target pages, takes screenshots, and synthesizes a concise report with citation-linked screenshots. We make the process transparent through a progress dashboard combining sub-task progress and real-time exploration maps for seamless takeover. DroidRetriever also pauses on detected privacy or high-risk screens and prompts intervention. Across 35 tasks over 24 apps, experiments and user studies demonstrate improvements in coverage, transparency, and reduced workload. We release our code at https://github.com/AkimotoAyako/DroidRetriever.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DroidRetriever, a transparent and steerable automation system for collaborative mobile information seeking. It uses a multi-LLM pipeline to accept voice or typed input, decompose tasks, navigate target pages in mobile apps, capture screenshots, and synthesize concise reports with citation-linked screenshots. Transparency is achieved via a progress dashboard that combines sub-task progress with real-time exploration maps, enabling seamless user takeover, while the system pauses on detected privacy or high-risk screens for intervention. The authors report that experiments and user studies across 35 tasks over 24 apps demonstrate improvements in coverage, transparency, and reduced workload, and they release the code publicly.

Significance. If the empirical claims hold after detailed reporting, the work could advance HCI research on mobile agents by demonstrating a practical balance between automation and user steerability in cross-app information seeking. The public code release is a clear strength that supports reproducibility. The approach addresses real workflow fragmentation but its impact depends on substantiating the reliability of the multi-LLM components across variable UIs.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: The claim that 'experiments and user studies demonstrate improvements in coverage, transparency, and reduced workload' across 35 tasks over 24 apps supplies no quantitative metrics, baselines, task selection criteria, statistical tests, or error analysis. Without these, the magnitude and reliability of the reported gains cannot be assessed and the central empirical contribution remains unevaluable.
  2. [System Architecture] System Architecture section: The multi-LLM pipeline is described as decomposing queries, navigating pages, and synthesizing reports, yet no mechanisms are specified for detecting or recovering from navigation dead-ends, OCR errors on dynamic screens, or privacy-screen misclassifications. Given that mobile UIs vary rapidly across apps, this omission is load-bearing for the claims of autonomous coverage improvements and reduced workload.
minor comments (2)
  1. [System Design] The progress dashboard description would benefit from additional detail on how exploration maps are rendered in real time and how user interventions are logged for later analysis.
  2. [Discussion] Consider adding a limitations subsection that explicitly discusses failure rates observed during the 35-task evaluation and the frequency of required user takeovers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our empirical results and system robustness. We address each major point below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: The claim that 'experiments and user studies demonstrate improvements in coverage, transparency, and reduced workload' across 35 tasks over 24 apps supplies no quantitative metrics, baselines, task selection criteria, statistical tests, or error analysis. Without these, the magnitude and reliability of the reported gains cannot be assessed and the central empirical contribution remains unevaluable.

    Authors: We agree that the current abstract and high-level summary in the Evaluation section lack the requested quantitative detail. In the revised manuscript we will expand the Evaluation section to report specific metrics including coverage rates (percentage of information items successfully retrieved), success rates across the 35 tasks, workload reduction via NASA-TLX scores, baseline comparisons against manual search and a prior mobile agent, explicit task selection criteria (tasks sampled from productivity, research, and shopping scenarios across the 24 apps), paired statistical tests with p-values, and a categorized error analysis of failure cases. These additions will make the magnitude and reliability of the gains directly assessable. revision: yes

  2. Referee: [System Architecture] System Architecture section: The multi-LLM pipeline is described as decomposing queries, navigating pages, and synthesizing reports, yet no mechanisms are specified for detecting or recovering from navigation dead-ends, OCR errors on dynamic screens, or privacy-screen misclassifications. Given that mobile UIs vary rapidly across apps, this omission is load-bearing for the claims of autonomous coverage improvements and reduced workload.

    Authors: We acknowledge that the nominal pipeline description omits explicit error-handling mechanisms. In the revision we will add a dedicated subsection detailing: (1) navigation dead-end detection via LLM-based page-state verification followed by backtracking or alternative path selection; (2) OCR error mitigation through multi-frame screenshot capture and cross-verification with the LLM; and (3) privacy-screen classification using an ensemble of vision-language models with confidence thresholding and fallback to user prompts. These mechanisms directly support the reliability claims for autonomous coverage and workload reduction in variable mobile UIs. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering system description with no derivations or fitted predictions

full rationale

The paper presents DroidRetriever as a multi-LLM mobile automation system with a progress dashboard, evaluated on 35 tasks across 24 apps for coverage, transparency, and workload. No equations, parameters, or predictions appear in the abstract or described architecture. The contribution is an implemented system plus user studies rather than a derivation chain; nothing reduces by construction to prior fitted values or self-citations. This is the expected non-finding for a systems paper whose claims rest on empirical demonstration rather than closed-form reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The system rests on domain assumptions about LLM reliability for task decomposition and mobile navigation plus user willingness to intervene via the dashboard; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Multi-LLM agents can accurately decompose information-seeking tasks and navigate mobile app interfaces across diverse apps
    Invoked in the description of task decomposition, navigation, and report synthesis.
invented entities (1)
  • Progress dashboard combining sub-task progress and real-time exploration maps no independent evidence
    purpose: Provide transparency and enable seamless user takeover
    New interface component introduced to address opacity of existing agents

pith-pipeline@v0.9.0 · 5714 in / 1395 out tokens · 53116 ms · 2026-05-22T16:57:56.653766+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 6 internal anchors

  1. [1]

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang

  2. [2]

    Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

    Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents. arXiv:2504.00906 [cs.AI] https://arxiv.org/abs/2504.00906

  3. [3]

    Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, Yuning Jiang, and Bo Zheng. 2025. InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning. arXiv:2508.19679 [cs.AI] https://arxiv.org/abs/ 2508.19679

  4. [4]

    Bennett, Kori Inkpen, Jaime Tee- van, Ruth Kikin-Gil, and Eric Horvitz

    Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Associa...

  5. [5]

    anthropic. 2024. Build with Claude - Computer use (beta). https://docs.anthropic. com/en/docs/build-with-claude/computer-use

  6. [6]

    Jaime Arguello and Rob Capra. 2016. The Effects of Aggregated Search Coherence on Search Behavior.ACM Trans. Inf. Syst.35, 1, Article 2 (Sept. 2016), 30 pages. doi:10.1145/2935747

  7. [7]

    Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen, and Abhanshu Sharma

  8. [8]

    InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Kate Larson (Ed.)

    ScreenAI: A Vision-Language Model for UI and Infographics Understand- ing. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Kate Larson (Ed.). International Joint Conferences on Ar- tificial Intelligence Organization, 3058–3068. doi:10.24963/ijcai.2024/339 Main Track

  9. [9]

    Sunandan Chakraborty, Zohaib Jabbar, and Lakshminarayanan Subramanian

  10. [10]

    InProceedings of the 2015 Annual Symposium on Computing for Development (London, United Kingdom)(DEV ’15)

    Summarization Search: A New Search Abstraction for Mobile Devices. InProceedings of the 2015 Annual Symposium on Computing for Development (London, United Kingdom)(DEV ’15). Association for Computing Machinery, New York, NY, USA, 69–70. doi:10.1145/2830629.2835217

  11. [11]

    Joseph Chee Chang, Nathan Hahn, and Aniket Kittur. 2016. Supporting Mobile Sensemaking Through Intentionally Uncertain Highlighting. InProceedings of the 29th Annual Symposium on User Interface Software and Technology(Tokyo, Japan)(UIST ’16). Association for Computing Machinery, New York, NY, USA, 61–68. doi:10.1145/2984511.2984538

  12. [12]

    Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A Mobile App Dataset for Building Data-Driven Design Applications. InProceedings of the 30th Annual ACM Symposium on User Interface Software and Technology(Québec City, QC, Canada)(UIST ’17). Association for Computing Machin...

  13. [13]

    Weiwei Gao, Kexin Du, Yujia Luo, Weinan Shi, Chun Yu, and Yuanchun Shi. 2024. EasyAsk: An In-App Contextual Tutorial Search Assistant for Older Adults with Voice and Touch Inputs.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies8, 3 (2024), 1–27

  14. [14]

    Genspark. 2024. Welcome to Genspark, the AI Agent Engine. https://mainfunc. ai/blog/genspark_intro

  15. [15]

    Aakar Gupta, Muhammed Anwar, and Ravin Balakrishnan. 2016. Porous Inter- faces for Small Screen Multitasking using Finger Identification. InProceedings of the 29th Annual Symposium on User Interface Software and Technology(Tokyo, Japan)(UIST ’16). Association for Computing Machinery, New York, NY, USA, 145–156. doi:10.1145/2984511.2984557

  16. [16]

    Nathan Hahn, Joseph Chee Chang, and Aniket Kittur. 2018. Bento Browser: Complex Mobile Search Without Tabs. InProceedings of the 2018 CHI Confer- ence on Human Factors in Computing Systems(Montreal QC, Canada)(CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/ 3173574.3173825

  17. [17]

    Nina Hollender, Cristian Hofmann, Michael Deneke, and Bernhard Schmitz. 2010. Integrating cognitive load theory and concepts of human–computer interaction. DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking CHI ’26, April 13–17, 2026, Barcelona, Spain Computers in Human Behavior26, 6 (2010), 1278–128...

  18. [18]

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. 2024. CogA- gent: A Visual Language Model for GUI Agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14281–14290

  19. [19]

    You Can Find a Part of my Life in Every Single App

    Kasper Hornbæk, Ulrik Lyngs, Olga Iarygina, and Mikael B. Skov. 2024. “You Can Find a Part of my Life in Every Single App”: An Interview Study of What Makes Smartphone Applications Special to Their Users. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New Yo...

  20. [20]

    Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Pittsburgh, Pennsylvania, USA)(CHI ’99). Association for Computing Machinery, New York, NY, USA, 159–166. doi:10.1145/302979.303030

  21. [21]

    2024.An Open Source Evaluation for Search APIs

    Mehul Chadda Ishaan, Akhilesh Sharma. 2024.An Open Source Evaluation for Search APIs. https://github.com/lumina-ai-inc/benchmark

  22. [22]

    ItzCrazyKns. 2025. Perplexica: A privacy-focused AI answering engine. GitHub repository. https://github.com/ItzCrazyKns/Perplexica Version 1.9.1

  23. [23]

    2023.YOLO by Ultralytics

    Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023.YOLO by Ultralytics. https: //github.com/ultralytics/ultralytics

  24. [24]

    Prerna Juneja, Wenjuan Zhang, Alison Marie Smith-Renner, Hemank Lamba, Joel Tetreault, and Alex Jaimes. 2024. Dissecting users’ needs for search result explanations. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 841, 17 pages. doi:...

  25. [25]

    Beata Jungselius and Alexandra Weilenmann. 2025. Tracing Change in Social Media Use: A Qualitative Longitudinal Study. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 957, 14 pages. doi:10.1145/ 3706598.3713813

  26. [26]

    Kai. 2024. Unleash AI Search Power with Devv.AI: A Developer’s Guide. https: //devv.ai/blog/post/devvai-devs-search-guide

  27. [27]

    Karlson, Shamsi T

    Amy K. Karlson, Shamsi T. Iqbal, Brian Meyers, Gonzalo Ramos, Kathy Lee, and John C. Tang. 2010. Mobile taskflow in context: a screenshot study of smartphone usage. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Atlanta, Georgia, USA)(CHI ’10). Association for Computing Machinery, New York, NY, USA, 2009–2018. doi:10.1145/175...

  28. [28]

    Karlson, George G

    Amy K. Karlson, George G. Robertson, Daniel C. Robbins, Mary P. Czerwinski, and Greg R. Smith. 2006. FaThumb: a facet-based interface for mobile search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Montréal, Québec, Canada)(CHI ’06). Association for Computing Machinery, New York, NY, USA, 711–720. doi:10.1145/1124772.1124878

  29. [29]

    Things on the Ground are Different

    Lindah Kotut and Hummd Alikhan. 2024. "Things on the Ground are Different": Utility, Survival and Ethics in Multi-Device Ownership and Smartphone Sharing Contexts. InProceedings of the 2024 CHI Conference on Human Factors in Comput- ing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 760, 14 pages. doi:...

  30. [30]

    Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of Explanatory Debugging to Personalize Interactive Machine Learning. InProceedings of the 20th International Conference on Intelligent User Interfaces (Atlanta, Georgia, USA)(IUI ’15). Association for Computing Machinery, New York, NY, USA, 126–137. doi:10.1145/2678025.2701399

  31. [31]

    Dmitry Lagun, Chih-Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014. Towards better measurement of attention and satisfaction in mobile search. InProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. 113–122

  32. [32]

    Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, and Insik Shin. 2025. Safeguarding Mobile GUI Agent via Logic-based Action Verification. arXiv:2503.18492 [cs.HC] https: //arxiv.org/abs/2503.18492

  33. [33]

    Lee and Katrina A

    John D. Lee and Katrina A. See. 2004. Trust in Automation: Designing for Appropriate Reliance.Human Factors46, 1 (2004), 50–80. doi:10.1518/hfes.46.1. 50_30392 PMID: 15151155

  34. [34]

    Explore, select, derive, and recall: Augmenting llm with human-like memory for mobile task automation.arXiv preprint arXiv:2312.03003, 2023

    Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steven Y. Ko, Sangeun Oh, and Insik Shin. 2024. Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation. arXiv:2312.03003 [cs.HC] https://arxiv.org/abs/2312.03003

  35. [35]

    Linlin Li, Ruifeng Wang, Xian Zhan, Ying Wang, Cuiyun Gao, Sinan Wang, and Yepang Liu. 2023. What You See Is What You Get? It Is Not the Case! Detecting Misleading Icons for Mobile Applications. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 538–550

  36. [36]

    Toby Jia-Jun Li, Jingya Chen, Brandon Canfield, and Brad A. Myers. 2020. Privacy- Preserving Script Sharing in GUI-based Programming-by-Demonstration Sys- tems.Proc. ACM Hum.-Comput. Interact.4, CSCW1, Article 60 (May 2020), 23 pages. doi:10.1145/3392869

  37. [37]

    Mitchell, and Brad A

    Toby Jia-Jun Li, Jingya Chen, Haijun Xia, Tom M. Mitchell, and Brad A. Myers

  38. [38]

    InProceedings of the 33rd Annual ACM Symposium on User Interface Soft- ware and Technology(Virtual Event, USA)(UIST ’20)

    Multi-Modal Repairs of Conversational Breakdowns in Task-Oriented Dialogs. InProceedings of the 33rd Annual ACM Symposium on User Interface Soft- ware and Technology(Virtual Event, USA)(UIST ’20). Association for Computing Machinery, New York, NY, USA, 1094–1107. doi:10.1145/3379337.3415820

  39. [39]

    Shaobo Liang and Ziyi Wei. 2024. Understanding Users’ App-Switching Be- havior During the Mobile Search: An Empirical Study from the Perspective of Push–Pull–Mooring Framework.Behavioral Sciences14, 11 (2024). doi:10.3390/ bs14110989

  40. [40]

    Yuwen Lu, Meng Chen, Qi Zhao, Victor Cox, Yang Yang, Meng Jiang, Jay Brock- man, Tamara Kay, and Toby Jia-Jun Li. 2025. Crepe: A Mobile Screen Data Collector Using Graph Query. arXiv:2406.16173 [cs.HC] https://arxiv.org/abs/ 2406.16173

  41. [41]

    Jiaxin Mao, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. Constructing Click Models for Mobile Search. InThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval(Ann Arbor, MI, USA)(SIGIR ’18). Association for Computing Machinery, New York, NY, USA, 775–784. doi:10. 1145/3209978.3210060

  42. [42]

    2024.Morphic

    Yoshiki Miura. 2024.Morphic. https://github.com/miurla/morphic

  43. [43]

    2023.ChatGPT Retrieval Plugin

    OpenAI. 2023.ChatGPT Retrieval Plugin. https://github.com/openai/chatgpt- retrieval-plugin

  44. [44]

    OpenAI. 2024. SearchGPT Prototype. https://openai.com/index/searchgpt- prototype/

  45. [45]

    OpenAI. 2025. Introducing Operator — an AI agent that can use a computer for you. https://openai.com/index/introducing-operator/

  46. [46]

    Bigham, and Amy Pavel

    Yi-Hao Peng, Dingzeyu Li, Jeffrey P. Bigham, and Amy Pavel. 2025. Morae: Proactively Pausing UI Agents for User Choices. arXiv:2508.21456 [cs.HC] https: //arxiv.org/abs/2508.21456

  47. [47]

    Peter Pirolli and Stuart Card. 1995. Information foraging in information access environments. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’95). ACM Press/Addison- Wesley Publishing Co., USA, 51–58. doi:10.1145/223904.223911

  48. [48]

    Peter Pirolli and Stuart Card. 2005. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. InConference: Proceedings of International Conference on Intelligence Analysis

  49. [49]

    Laura Pope Robbins, Lisa Esposito, Chris Kretz, and Michael Aloi. 2007. What a user wants: Redesigning a library’s web site based on a card-sort analysis.Journal of Web Librarianship1, 4 (2007), 3–27

  50. [50]

    Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weim- ing Lu, Dongsheng Li, and Yueting Zhuang. 2024. TaskBench: Benchmarking Large Language Models for Task Automation. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associa...

  51. [51]

    Jaspreet Singh and Avishek Anand. 2019. EXS: Explainable Search Using Local Model Agnostic Interpretability. InProceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, New York, NY, USA, 770–773

  52. [52]

    Chaparro

    Christina Siu and Barbara S. Chaparro. 2014. First Look: Examining the Horizontal Grid Layout using Eye-tracking.Proceedings of the Human Factors and Ergonomics Society Annual Meeting58, 1 (2014), 1119–1123. arXiv:https://doi.org/10.1177/1541931214581234 doi:10.1177/1541931214581234

  53. [53]

    Yunpeng Song, Yiheng Bian, Yongtao Tang, Guiyu Ma, and Zhongmin Cai. 2024. VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machinery, New York, NY, USA, Article 4...

  54. [54]

    Minh Duc Vu, Han Wang, Jieshan Chen, Zhuang Li, Shengdong Zhao, Zhenchang Xing, and Chunyang Chen. 2024. GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning. InProceed- ings of the 37th Annual ACM Symposium on User Interface Software and Technology (Pittsburgh, PA, USA)(UIST ’24). Association for ...

  55. [55]

    Wanderboat. [n. d.]. Your everyday Al companion for getaway ideas. https: //wanderboat.ai/about

  56. [56]

    Bryan Wang, Gang Li, and Yang Li. 2023. Enabling Conversational Interaction with Mobile UI using Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 432, 17 pages. doi:10.1145/3544548.3580895

  57. [57]

    Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. 2021. Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning. InThe 34th Annual ACM Symposium on User Interface Software and Technology(Virtual Event, USA)(UIST ’21). Association for Computing Machinery, New York, NY, USA, 498–510. https://doi.org/10.1145/3472749.3474765

  58. [58]

    Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration. CHI ’26, April 13–17, 2026, Barcelona, Spain Bian et al. arXiv preprint arXiv:2406.01014(2024)

  59. [59]

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158(2024)

  60. [60]

    Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. Autodroid: Llm-powered task automation in android. InProceedings of the 30th Annual International Con- ference on Mobile Computing and Networking. 543–557

  61. [61]

    Magdalena Wischnewski, Nicole Krämer, and Emmanuel Müller. 2023. Measur- ing and Understanding Trust Calibrations for Automated Systems: A Survey of the State-Of-The-Art and Future Directions. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, A...

  62. [62]

    An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. 2023. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562(2023)

  63. [63]

    2024.Search with Lepton

    Yadong Xie Yangqing Jia. 2024.Search with Lepton. https://github.com/leptonai/ search_with_lepton

  64. [64]

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. 2025. Mobile-Agent-v3: Fundamental Agents for GUI Automation. arXiv:2508.15144 [cs.AI] https://arxiv.org/abs/2508.15144

  65. [65]

    Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2024. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs.arXiv preprint arXiv:2404.05719(2024)

  66. [66]

    Chen-Hsiang Yu. 2012. Mobile continuous reading. InCHI ’12 Extended Abstracts on Human Factors in Computing Systems(Austin, Texas, USA)(CHI EA ’12). Association for Computing Machinery, New York, NY, USA, 1405–1410. doi:10. 1145/2212776.2212463

  67. [67]

    Ja Eun Yu and Debaleena Chattopadhyay. 2024. Reducing the Search Space on demand helps Older Adults find Mobile UI Features quickly, on par with Younger Adults. InProceedings of the CHI Conference on Human Factors in Computing Systems. 1–22

  68. [68]

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023. AppAgent: Multimodal Agents as Smartphone Users. arXiv:2312.13771 [cs.CV] https://arxiv.org/abs/2312.13771

  69. [69]

    2024.SenseVoice

    Shi Xian zhifu gao, Lizerui9926. 2024.SenseVoice. https://github.com/ FunAudioLLM/SenseVoice

  70. [70]

    scrolling screenshot

    Xiyue Zhu, Peng Tang, Haofu Liao, and Srikar Appalaraju. 2025. Turbocharging Web Automation: The Impact of Compressed History States. InFindings of the As- sociation for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Compu- tational Linguistics, Vienna, Austria, 3644...