DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking

Guiyu Ma; Rongrong Zhu; Yiheng Bian; Yunpeng Song; Zhongmin Cai

arxiv: 2505.03364 · v2 · submitted 2025-05-06 · 💻 cs.HC

DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking

Yiheng Bian , Yunpeng Song , Guiyu Ma , Rongrong Zhu , Zhongmin Cai This is my paper

Pith reviewed 2026-05-22 16:57 UTC · model grok-4.3

classification 💻 cs.HC

keywords mobile agentsinformation seekingtransparent automationsteerable systemsmulti-LLM pipelineprogress dashboardcross-app navigationuser intervention

0 comments

The pith

DroidRetriever uses a multi-LLM pipeline and live dashboard to let users monitor, steer, and intervene in cross-app mobile searches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DroidRetriever as a system that accepts a query, breaks it into sub-tasks with language models, navigates apps, captures screenshots, and assembles a report while showing the entire process to the user. A dashboard displays sub-task status alongside maps of explored content so people can take over at any moment or approve actions on private screens. The approach is evaluated on 35 tasks spanning 24 apps, where it produced higher coverage, clearer visibility into the work, and lower effort than prior mobile agents. If the core mechanisms hold, fragmented mobile information gathering could shift from repeated context switches and manual re-entry to a guided, interruptible collaboration between user and automation.

Core claim

DroidRetriever accepts voice or typed input and employs a multi-LLM system to decompose tasks, navigate target pages, take screenshots, and synthesize concise reports with citation-linked screenshots; transparency is achieved through a progress dashboard that combines sub-task status with real-time exploration maps, allowing seamless user takeover, while the system pauses on detected privacy or high-risk screens to prompt intervention.

What carries the argument

The progress dashboard that merges sub-task progress indicators with real-time exploration maps, backed by the multi-LLM pipeline for task decomposition, navigation, screenshot capture, and report synthesis.

If this is right

Final reports include citation-linked screenshots that let users verify each piece of information against its source screen.
The system pauses automatically before displaying or acting on privacy-sensitive or high-risk content.
Users avoid repetitive context switching and data re-entry because the automation maintains state across apps.
Overall coverage rises as the system systematically explores multiple sources while the dashboard keeps the user informed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same combination of visible maps and pause points could reduce opacity in automation tools built for desktop or web environments.
Repeated successful handoffs between agent and user may encourage designers to add explicit takeover features to other personal-data agents.
Over time the dashboard data could reveal common intervention patterns that inform better default behaviors for future versions.

Load-bearing premise

The multi-LLM system can reliably decompose queries, navigate through diverse apps, and produce accurate reports without frequent errors or getting stuck.

What would settle it

If evaluation on the 35 tasks shows frequent navigation failures, incomplete reports, or no measurable drop in user workload and context switching, the claimed improvements would not hold.

Figures

Figures reproduced from arXiv: 2505.03364 by Guiyu Ma, Rongrong Zhu, Yiheng Bian, Yunpeng Song, Zhongmin Cai.

**Figure 1.** Figure 1: Comparison between DroidRetriever (left) and the general-purpose LLM-driven agent (right). DroidRetriever consists [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The Automated Workflow of DroidRetriever (excluding manual intervention). It includes 3 modules: task decomposi [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Page-level decomposition, showing focused mode, list-view mode, and multi-page mode. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Intervention mechanisms: 𝐼𝑛𝑡𝑒𝑟𝑣𝑒𝑛𝑡𝑖𝑜𝑛𝑎 requires gesture operations during intervention, such as tapping and text input, and also includes intervention for proactive alerts on privacy-sensitive operations and high-risk actions. 𝐼𝑛𝑡𝑒𝑟𝑣𝑒𝑛𝑡𝑖𝑜𝑛𝑏 lets the user take a screenshot and save the current interface to the search results database. 𝐼𝑛𝑡𝑒𝑟𝑣𝑒𝑛𝑡𝑖𝑜𝑛𝑐 signifies the intention to terminate the UI copilot [PITH_… view at source ↗

**Figure 5.** Figure 5: User interfaces: (a) shows the intervention widget: (a-1) intervention - interrupt and take over, (a-2) tap to return [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Results of Study 1: (a) Coverage, accuracy, and redundancy rates for manual vs. system-generated reports; (b) Overall quality ratings. ↓ indicates lower is better. *** indicates a significant difference with 𝑝 < .001, while ** indicates significance with 𝑝 < .01. formal report, and could record in any format to reflect real-life "capture-and-notes" habits. The platform supported copying any content from th… view at source ↗

**Figure 7.** Figure 7: Results of Study 2, including task decomposition and a comparative evaluation of Human, DroidRetriever, LLM-driven search engines (Qwen & ChatGPT), Claude Computer Use, and Mobile-Agent-v2. (a) Page-level decomposition confusion matrix. (b) Ratio of user-intervention time to total task duration for four intervention types and overall. (c) Task-wise and Step-wise Intervention Rates for DroidRetriever.(d-e) … view at source ↗

**Figure 8.** Figure 8: Illustration of scrolling screenshot. As shown in [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

Information seeking on mobile devices is often fragmented, trapping users in repetitive cycles of context switching and data re-entry, which increases cognitive load and disrupts workflow. Existing mobile agents provide limited cross-source integration and are largely opaque, presenting progress as a linear feed with few opportunities to intervene, steer, or take control. We present DroidRetriever, a transparent, steerable system for cross-source mobile information seeking. It accepts voice or typed input and the multi-LLM system decomposes the task, navigates to target pages, takes screenshots, and synthesizes a concise report with citation-linked screenshots. We make the process transparent through a progress dashboard combining sub-task progress and real-time exploration maps for seamless takeover. DroidRetriever also pauses on detected privacy or high-risk screens and prompts intervention. Across 35 tasks over 24 apps, experiments and user studies demonstrate improvements in coverage, transparency, and reduced workload. We release our code at https://github.com/AkimotoAyako/DroidRetriever.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DroidRetriever describes a multi-LLM mobile agent with a steerable dashboard and privacy pauses, but the evaluation supplies no numbers or failure analysis to back the claimed gains.

read the letter

The main thing here is a concrete system for mobile information tasks that breaks queries into sub-tasks via multiple LLMs, navigates apps, pulls screenshots, and builds a report with linked images. It adds a dashboard showing progress and an exploration map so users can jump in, plus automatic pauses on private screens. Code is out on GitHub, which is useful for anyone wanting to try the architecture. That combination of decomposition, real-time visibility, and intervention points is the clearest addition over earlier linear mobile agents. The authors target a real daily friction—switching between apps and re-entering data—and the steerable design could cut some of that load if it works as described. The abstract says experiments and user studies across 35 tasks in 24 apps showed better coverage, transparency, and lower workload, but it gives no actual metrics, baselines, statistical tests, or error rates. Without those, the improvements stay hard to judge. The stress-test note flags the missing reliability data, and that lands: mobile UIs shift often, OCR on screenshots can fail, and nothing shows how the system recovers from wrong navigation or bad synthesis. If users end up intervening a lot, the workload reduction claim weakens. The paper is an engineering description rather than a closed theoretical result, so the circularity burden is low but the evidence bar is also low. This is mainly for HCI people building or studying mobile agents who want a working example with dashboard ideas. A reader looking for solid quantitative backing or robustness tests will come away wanting more. It deserves peer review once the authors add the missing numbers and failure analysis; the core idea is practical enough to be worth referee time.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DroidRetriever, a transparent and steerable automation system for collaborative mobile information seeking. It uses a multi-LLM pipeline to accept voice or typed input, decompose tasks, navigate target pages in mobile apps, capture screenshots, and synthesize concise reports with citation-linked screenshots. Transparency is achieved via a progress dashboard that combines sub-task progress with real-time exploration maps, enabling seamless user takeover, while the system pauses on detected privacy or high-risk screens for intervention. The authors report that experiments and user studies across 35 tasks over 24 apps demonstrate improvements in coverage, transparency, and reduced workload, and they release the code publicly.

Significance. If the empirical claims hold after detailed reporting, the work could advance HCI research on mobile agents by demonstrating a practical balance between automation and user steerability in cross-app information seeking. The public code release is a clear strength that supports reproducibility. The approach addresses real workflow fragmentation but its impact depends on substantiating the reliability of the multi-LLM components across variable UIs.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: The claim that 'experiments and user studies demonstrate improvements in coverage, transparency, and reduced workload' across 35 tasks over 24 apps supplies no quantitative metrics, baselines, task selection criteria, statistical tests, or error analysis. Without these, the magnitude and reliability of the reported gains cannot be assessed and the central empirical contribution remains unevaluable.
[System Architecture] System Architecture section: The multi-LLM pipeline is described as decomposing queries, navigating pages, and synthesizing reports, yet no mechanisms are specified for detecting or recovering from navigation dead-ends, OCR errors on dynamic screens, or privacy-screen misclassifications. Given that mobile UIs vary rapidly across apps, this omission is load-bearing for the claims of autonomous coverage improvements and reduced workload.

minor comments (2)

[System Design] The progress dashboard description would benefit from additional detail on how exploration maps are rendered in real time and how user interventions are logged for later analysis.
[Discussion] Consider adding a limitations subsection that explicitly discusses failure rates observed during the 35-task evaluation and the frequency of required user takeovers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our empirical results and system robustness. We address each major point below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: The claim that 'experiments and user studies demonstrate improvements in coverage, transparency, and reduced workload' across 35 tasks over 24 apps supplies no quantitative metrics, baselines, task selection criteria, statistical tests, or error analysis. Without these, the magnitude and reliability of the reported gains cannot be assessed and the central empirical contribution remains unevaluable.

Authors: We agree that the current abstract and high-level summary in the Evaluation section lack the requested quantitative detail. In the revised manuscript we will expand the Evaluation section to report specific metrics including coverage rates (percentage of information items successfully retrieved), success rates across the 35 tasks, workload reduction via NASA-TLX scores, baseline comparisons against manual search and a prior mobile agent, explicit task selection criteria (tasks sampled from productivity, research, and shopping scenarios across the 24 apps), paired statistical tests with p-values, and a categorized error analysis of failure cases. These additions will make the magnitude and reliability of the gains directly assessable. revision: yes
Referee: [System Architecture] System Architecture section: The multi-LLM pipeline is described as decomposing queries, navigating pages, and synthesizing reports, yet no mechanisms are specified for detecting or recovering from navigation dead-ends, OCR errors on dynamic screens, or privacy-screen misclassifications. Given that mobile UIs vary rapidly across apps, this omission is load-bearing for the claims of autonomous coverage improvements and reduced workload.

Authors: We acknowledge that the nominal pipeline description omits explicit error-handling mechanisms. In the revision we will add a dedicated subsection detailing: (1) navigation dead-end detection via LLM-based page-state verification followed by backtracking or alternative path selection; (2) OCR error mitigation through multi-frame screenshot capture and cross-verification with the LLM; and (3) privacy-screen classification using an ensemble of vision-language models with confidence thresholding and fallback to user prompts. These mechanisms directly support the reliability claims for autonomous coverage and workload reduction in variable mobile UIs. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering system description with no derivations or fitted predictions

full rationale

The paper presents DroidRetriever as a multi-LLM mobile automation system with a progress dashboard, evaluated on 35 tasks across 24 apps for coverage, transparency, and workload. No equations, parameters, or predictions appear in the abstract or described architecture. The contribution is an implemented system plus user studies rather than a derivation chain; nothing reduces by construction to prior fitted values or self-citations. This is the expected non-finding for a systems paper whose claims rest on empirical demonstration rather than closed-form reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The system rests on domain assumptions about LLM reliability for task decomposition and mobile navigation plus user willingness to intervene via the dashboard; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Multi-LLM agents can accurately decompose information-seeking tasks and navigate mobile app interfaces across diverse apps
Invoked in the description of task decomposition, navigation, and report synthesis.

invented entities (1)

Progress dashboard combining sub-task progress and real-time exploration maps no independent evidence
purpose: Provide transparency and enable seamless user takeover
New interface component introduced to address opacity of existing agents

pith-pipeline@v0.9.0 · 5714 in / 1395 out tokens · 53116 ms · 2026-05-22T16:57:56.653766+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 6 internal anchors

[1]

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang

work page
[2]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents. arXiv:2504.00906 [cs.AI] https://arxiv.org/abs/2504.00906

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, Yuning Jiang, and Bo Zheng. 2025. InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning. arXiv:2508.19679 [cs.AI] https://arxiv.org/abs/ 2508.19679

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Bennett, Kori Inkpen, Jaime Tee- van, Ruth Kikin-Gil, and Eric Horvitz

Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Associa...

work page doi:10.1145/3290605.3300233 2019
[5]

anthropic. 2024. Build with Claude - Computer use (beta). https://docs.anthropic. com/en/docs/build-with-claude/computer-use

work page 2024
[6]

Jaime Arguello and Rob Capra. 2016. The Effects of Aggregated Search Coherence on Search Behavior.ACM Trans. Inf. Syst.35, 1, Article 2 (Sept. 2016), 30 pages. doi:10.1145/2935747

work page doi:10.1145/2935747 2016
[7]

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen, and Abhanshu Sharma

work page
[8]

InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Kate Larson (Ed.)

ScreenAI: A Vision-Language Model for UI and Infographics Understand- ing. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Kate Larson (Ed.). International Joint Conferences on Ar- tificial Intelligence Organization, 3058–3068. doi:10.24963/ijcai.2024/339 Main Track

work page doi:10.24963/ijcai.2024/339 2024
[9]

Sunandan Chakraborty, Zohaib Jabbar, and Lakshminarayanan Subramanian

work page
[10]

InProceedings of the 2015 Annual Symposium on Computing for Development (London, United Kingdom)(DEV ’15)

Summarization Search: A New Search Abstraction for Mobile Devices. InProceedings of the 2015 Annual Symposium on Computing for Development (London, United Kingdom)(DEV ’15). Association for Computing Machinery, New York, NY, USA, 69–70. doi:10.1145/2830629.2835217

work page doi:10.1145/2830629.2835217 2015
[11]

Joseph Chee Chang, Nathan Hahn, and Aniket Kittur. 2016. Supporting Mobile Sensemaking Through Intentionally Uncertain Highlighting. InProceedings of the 29th Annual Symposium on User Interface Software and Technology(Tokyo, Japan)(UIST ’16). Association for Computing Machinery, New York, NY, USA, 61–68. doi:10.1145/2984511.2984538

work page doi:10.1145/2984511.2984538 2016
[12]

Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A Mobile App Dataset for Building Data-Driven Design Applications. InProceedings of the 30th Annual ACM Symposium on User Interface Software and Technology(Québec City, QC, Canada)(UIST ’17). Association for Computing Machin...

work page doi:10.1145/3126594.3126651 2017
[13]

Weiwei Gao, Kexin Du, Yujia Luo, Weinan Shi, Chun Yu, and Yuanchun Shi. 2024. EasyAsk: An In-App Contextual Tutorial Search Assistant for Older Adults with Voice and Touch Inputs.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies8, 3 (2024), 1–27

work page 2024
[14]

Genspark. 2024. Welcome to Genspark, the AI Agent Engine. https://mainfunc. ai/blog/genspark_intro

work page 2024
[15]

Aakar Gupta, Muhammed Anwar, and Ravin Balakrishnan. 2016. Porous Inter- faces for Small Screen Multitasking using Finger Identification. InProceedings of the 29th Annual Symposium on User Interface Software and Technology(Tokyo, Japan)(UIST ’16). Association for Computing Machinery, New York, NY, USA, 145–156. doi:10.1145/2984511.2984557

work page doi:10.1145/2984511.2984557 2016
[16]

Nathan Hahn, Joseph Chee Chang, and Aniket Kittur. 2018. Bento Browser: Complex Mobile Search Without Tabs. InProceedings of the 2018 CHI Confer- ence on Human Factors in Computing Systems(Montreal QC, Canada)(CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/ 3173574.3173825

work page arXiv 2018
[17]

Nina Hollender, Cristian Hofmann, Michael Deneke, and Bernhard Schmitz. 2010. Integrating cognitive load theory and concepts of human–computer interaction. DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking CHI ’26, April 13–17, 2026, Barcelona, Spain Computers in Human Behavior26, 6 (2010), 1278–128...

work page doi:10.1016/j.chb.2010.05 2010
[18]

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. 2024. CogA- gent: A Visual Language Model for GUI Agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14281–14290

work page 2024
[19]

You Can Find a Part of my Life in Every Single App

Kasper Hornbæk, Ulrik Lyngs, Olga Iarygina, and Mikael B. Skov. 2024. “You Can Find a Part of my Life in Every Single App”: An Interview Study of What Makes Smartphone Applications Special to Their Users. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New Yo...

work page doi:10.1145/3613904.3642820 2024
[20]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Pittsburgh, Pennsylvania, USA)(CHI ’99). Association for Computing Machinery, New York, NY, USA, 159–166. doi:10.1145/302979.303030

work page doi:10.1145/302979.303030 1999
[21]

2024.An Open Source Evaluation for Search APIs

Mehul Chadda Ishaan, Akhilesh Sharma. 2024.An Open Source Evaluation for Search APIs. https://github.com/lumina-ai-inc/benchmark

work page 2024
[22]

ItzCrazyKns. 2025. Perplexica: A privacy-focused AI answering engine. GitHub repository. https://github.com/ItzCrazyKns/Perplexica Version 1.9.1

work page 2025
[23]

2023.YOLO by Ultralytics

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023.YOLO by Ultralytics. https: //github.com/ultralytics/ultralytics

work page 2023
[24]

Prerna Juneja, Wenjuan Zhang, Alison Marie Smith-Renner, Hemank Lamba, Joel Tetreault, and Alex Jaimes. 2024. Dissecting users’ needs for search result explanations. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 841, 17 pages. doi:...

work page doi:10.1145/3613904 2024
[25]

Beata Jungselius and Alexandra Weilenmann. 2025. Tracing Change in Social Media Use: A Qualitative Longitudinal Study. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 957, 14 pages. doi:10.1145/ 3706598.3713813

work page arXiv 2025
[26]

Kai. 2024. Unleash AI Search Power with Devv.AI: A Developer’s Guide. https: //devv.ai/blog/post/devvai-devs-search-guide

work page 2024
[27]

Karlson, Shamsi T

Amy K. Karlson, Shamsi T. Iqbal, Brian Meyers, Gonzalo Ramos, Kathy Lee, and John C. Tang. 2010. Mobile taskflow in context: a screenshot study of smartphone usage. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Atlanta, Georgia, USA)(CHI ’10). Association for Computing Machinery, New York, NY, USA, 2009–2018. doi:10.1145/175...

work page doi:10.1145/1753326.1753631 2010
[28]

Karlson, George G

Amy K. Karlson, George G. Robertson, Daniel C. Robbins, Mary P. Czerwinski, and Greg R. Smith. 2006. FaThumb: a facet-based interface for mobile search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Montréal, Québec, Canada)(CHI ’06). Association for Computing Machinery, New York, NY, USA, 711–720. doi:10.1145/1124772.1124878

work page doi:10.1145/1124772.1124878 2006
[29]

Things on the Ground are Different

Lindah Kotut and Hummd Alikhan. 2024. "Things on the Ground are Different": Utility, Survival and Ethics in Multi-Device Ownership and Smartphone Sharing Contexts. InProceedings of the 2024 CHI Conference on Human Factors in Comput- ing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 760, 14 pages. doi:...

work page doi:10.1145/3613904.3642874 2024
[30]

Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of Explanatory Debugging to Personalize Interactive Machine Learning. InProceedings of the 20th International Conference on Intelligent User Interfaces (Atlanta, Georgia, USA)(IUI ’15). Association for Computing Machinery, New York, NY, USA, 126–137. doi:10.1145/2678025.2701399

work page doi:10.1145/2678025.2701399 2015
[31]

Dmitry Lagun, Chih-Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014. Towards better measurement of attention and satisfaction in mobile search. InProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. 113–122

work page 2014
[32]

Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, and Insik Shin. 2025. Safeguarding Mobile GUI Agent via Logic-based Action Verification. arXiv:2503.18492 [cs.HC] https: //arxiv.org/abs/2503.18492

work page arXiv 2025
[33]

Lee and Katrina A

John D. Lee and Katrina A. See. 2004. Trust in Automation: Designing for Appropriate Reliance.Human Factors46, 1 (2004), 50–80. doi:10.1518/hfes.46.1. 50_30392 PMID: 15151155

work page doi:10.1518/hfes.46.1 2004
[34]

Explore, select, derive, and recall: Augmenting llm with human-like memory for mobile task automation.arXiv preprint arXiv:2312.03003, 2023

Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steven Y. Ko, Sangeun Oh, and Insik Shin. 2024. Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation. arXiv:2312.03003 [cs.HC] https://arxiv.org/abs/2312.03003

work page arXiv 2024
[35]

Linlin Li, Ruifeng Wang, Xian Zhan, Ying Wang, Cuiyun Gao, Sinan Wang, and Yepang Liu. 2023. What You See Is What You Get? It Is Not the Case! Detecting Misleading Icons for Mobile Applications. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 538–550

work page 2023
[36]

Toby Jia-Jun Li, Jingya Chen, Brandon Canfield, and Brad A. Myers. 2020. Privacy- Preserving Script Sharing in GUI-based Programming-by-Demonstration Sys- tems.Proc. ACM Hum.-Comput. Interact.4, CSCW1, Article 60 (May 2020), 23 pages. doi:10.1145/3392869

work page doi:10.1145/3392869 2020
[37]

Mitchell, and Brad A

Toby Jia-Jun Li, Jingya Chen, Haijun Xia, Tom M. Mitchell, and Brad A. Myers

work page
[38]

InProceedings of the 33rd Annual ACM Symposium on User Interface Soft- ware and Technology(Virtual Event, USA)(UIST ’20)

Multi-Modal Repairs of Conversational Breakdowns in Task-Oriented Dialogs. InProceedings of the 33rd Annual ACM Symposium on User Interface Soft- ware and Technology(Virtual Event, USA)(UIST ’20). Association for Computing Machinery, New York, NY, USA, 1094–1107. doi:10.1145/3379337.3415820

work page doi:10.1145/3379337.3415820
[39]

Shaobo Liang and Ziyi Wei. 2024. Understanding Users’ App-Switching Be- havior During the Mobile Search: An Empirical Study from the Perspective of Push–Pull–Mooring Framework.Behavioral Sciences14, 11 (2024). doi:10.3390/ bs14110989

work page 2024
[40]

Yuwen Lu, Meng Chen, Qi Zhao, Victor Cox, Yang Yang, Meng Jiang, Jay Brock- man, Tamara Kay, and Toby Jia-Jun Li. 2025. Crepe: A Mobile Screen Data Collector Using Graph Query. arXiv:2406.16173 [cs.HC] https://arxiv.org/abs/ 2406.16173

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Jiaxin Mao, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. Constructing Click Models for Mobile Search. InThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval(Ann Arbor, MI, USA)(SIGIR ’18). Association for Computing Machinery, New York, NY, USA, 775–784. doi:10. 1145/3209978.3210060

work page arXiv 2018
[42]

2024.Morphic

Yoshiki Miura. 2024.Morphic. https://github.com/miurla/morphic

work page 2024
[43]

2023.ChatGPT Retrieval Plugin

OpenAI. 2023.ChatGPT Retrieval Plugin. https://github.com/openai/chatgpt- retrieval-plugin

work page 2023
[44]

OpenAI. 2024. SearchGPT Prototype. https://openai.com/index/searchgpt- prototype/

work page 2024
[45]

OpenAI. 2025. Introducing Operator — an AI agent that can use a computer for you. https://openai.com/index/introducing-operator/

work page 2025
[46]

Bigham, and Amy Pavel

Yi-Hao Peng, Dingzeyu Li, Jeffrey P. Bigham, and Amy Pavel. 2025. Morae: Proactively Pausing UI Agents for User Choices. arXiv:2508.21456 [cs.HC] https: //arxiv.org/abs/2508.21456

work page arXiv 2025
[47]

Peter Pirolli and Stuart Card. 1995. Information foraging in information access environments. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’95). ACM Press/Addison- Wesley Publishing Co., USA, 51–58. doi:10.1145/223904.223911

work page doi:10.1145/223904.223911 1995
[48]

Peter Pirolli and Stuart Card. 2005. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. InConference: Proceedings of International Conference on Intelligence Analysis

work page 2005
[49]

Laura Pope Robbins, Lisa Esposito, Chris Kretz, and Michael Aloi. 2007. What a user wants: Redesigning a library’s web site based on a card-sort analysis.Journal of Web Librarianship1, 4 (2007), 3–27

work page 2007
[50]

Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weim- ing Lu, Dongsheng Li, and Yueting Zhuang. 2024. TaskBench: Benchmarking Large Language Models for Task Automation. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associa...

work page doi:10.52202/079017-0148 2024
[51]

Jaspreet Singh and Avishek Anand. 2019. EXS: Explainable Search Using Local Model Agnostic Interpretability. InProceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, New York, NY, USA, 770–773

work page 2019
[52]

Chaparro

Christina Siu and Barbara S. Chaparro. 2014. First Look: Examining the Horizontal Grid Layout using Eye-tracking.Proceedings of the Human Factors and Ergonomics Society Annual Meeting58, 1 (2014), 1119–1123. arXiv:https://doi.org/10.1177/1541931214581234 doi:10.1177/1541931214581234

work page doi:10.1177/1541931214581234 2014
[53]

Yunpeng Song, Yiheng Bian, Yongtao Tang, Guiyu Ma, and Zhongmin Cai. 2024. VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machinery, New York, NY, USA, Article 4...

work page arXiv 2024
[54]

Minh Duc Vu, Han Wang, Jieshan Chen, Zhuang Li, Shengdong Zhao, Zhenchang Xing, and Chunyang Chen. 2024. GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning. InProceed- ings of the 37th Annual ACM Symposium on User Interface Software and Technology (Pittsburgh, PA, USA)(UIST ’24). Association for ...

work page doi:10.1145/3654777.3676356 2024
[55]

Wanderboat. [n. d.]. Your everyday Al companion for getaway ideas. https: //wanderboat.ai/about

work page
[56]

Bryan Wang, Gang Li, and Yang Li. 2023. Enabling Conversational Interaction with Mobile UI using Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 432, 17 pages. doi:10.1145/3544548.3580895

work page doi:10.1145/3544548.3580895 2023
[57]

Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. 2021. Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning. InThe 34th Annual ACM Symposium on User Interface Software and Technology(Virtual Event, USA)(UIST ’21). Association for Computing Machinery, New York, NY, USA, 498–510. https://doi.org/10.1145/3472749.3474765

work page doi:10.1145/3472749.3474765 2021
[58]

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration. CHI ’26, April 13–17, 2026, Barcelona, Spain Bian et al. arXiv preprint arXiv:2406.01014(2024)

work page arXiv 2024
[59]

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. Autodroid: Llm-powered task automation in android. InProceedings of the 30th Annual International Con- ference on Mobile Computing and Networking. 543–557

work page 2024
[61]

Magdalena Wischnewski, Nicole Krämer, and Emmanuel Müller. 2023. Measur- ing and Understanding Trust Calibrations for Automated Systems: A Survey of the State-Of-The-Art and Future Directions. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, A...

work page doi:10.1145/3544548.3581197 2023
[62]

An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. 2023. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562(2023)

work page arXiv 2023
[63]

2024.Search with Lepton

Yadong Xie Yangqing Jia. 2024.Search with Lepton. https://github.com/leptonai/ search_with_lepton

work page 2024
[64]

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. 2025. Mobile-Agent-v3: Fundamental Agents for GUI Automation. arXiv:2508.15144 [cs.AI] https://arxiv.org/abs/2508.15144

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2024. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs.arXiv preprint arXiv:2404.05719(2024)

work page arXiv 2024
[66]

Chen-Hsiang Yu. 2012. Mobile continuous reading. InCHI ’12 Extended Abstracts on Human Factors in Computing Systems(Austin, Texas, USA)(CHI EA ’12). Association for Computing Machinery, New York, NY, USA, 1405–1410. doi:10. 1145/2212776.2212463

work page arXiv 2012
[67]

Ja Eun Yu and Debaleena Chattopadhyay. 2024. Reducing the Search Space on demand helps Older Adults find Mobile UI Features quickly, on par with Younger Adults. InProceedings of the CHI Conference on Human Factors in Computing Systems. 1–22

work page 2024
[68]

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023. AppAgent: Multimodal Agents as Smartphone Users. arXiv:2312.13771 [cs.CV] https://arxiv.org/abs/2312.13771

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

2024.SenseVoice

Shi Xian zhifu gao, Lizerui9926. 2024.SenseVoice. https://github.com/ FunAudioLLM/SenseVoice

work page 2024
[70]

scrolling screenshot

Xiyue Zhu, Peng Tang, Haofu Liao, and Srikar Appalaraju. 2025. Turbocharging Web Automation: The Impact of Compressed History States. InFindings of the As- sociation for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Compu- tational Linguistics, Vienna, Austria, 3644...

work page doi:10.18653/v1/2025.findings- 2025

[1] [1]

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang

work page

[2] [2]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents. arXiv:2504.00906 [cs.AI] https://arxiv.org/abs/2504.00906

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, Yuning Jiang, and Bo Zheng. 2025. InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning. arXiv:2508.19679 [cs.AI] https://arxiv.org/abs/ 2508.19679

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Bennett, Kori Inkpen, Jaime Tee- van, Ruth Kikin-Gil, and Eric Horvitz

Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Associa...

work page doi:10.1145/3290605.3300233 2019

[5] [5]

anthropic. 2024. Build with Claude - Computer use (beta). https://docs.anthropic. com/en/docs/build-with-claude/computer-use

work page 2024

[6] [6]

Jaime Arguello and Rob Capra. 2016. The Effects of Aggregated Search Coherence on Search Behavior.ACM Trans. Inf. Syst.35, 1, Article 2 (Sept. 2016), 30 pages. doi:10.1145/2935747

work page doi:10.1145/2935747 2016

[7] [7]

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen, and Abhanshu Sharma

work page

[8] [8]

InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Kate Larson (Ed.)

ScreenAI: A Vision-Language Model for UI and Infographics Understand- ing. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Kate Larson (Ed.). International Joint Conferences on Ar- tificial Intelligence Organization, 3058–3068. doi:10.24963/ijcai.2024/339 Main Track

work page doi:10.24963/ijcai.2024/339 2024

[9] [9]

Sunandan Chakraborty, Zohaib Jabbar, and Lakshminarayanan Subramanian

work page

[10] [10]

InProceedings of the 2015 Annual Symposium on Computing for Development (London, United Kingdom)(DEV ’15)

Summarization Search: A New Search Abstraction for Mobile Devices. InProceedings of the 2015 Annual Symposium on Computing for Development (London, United Kingdom)(DEV ’15). Association for Computing Machinery, New York, NY, USA, 69–70. doi:10.1145/2830629.2835217

work page doi:10.1145/2830629.2835217 2015

[11] [11]

Joseph Chee Chang, Nathan Hahn, and Aniket Kittur. 2016. Supporting Mobile Sensemaking Through Intentionally Uncertain Highlighting. InProceedings of the 29th Annual Symposium on User Interface Software and Technology(Tokyo, Japan)(UIST ’16). Association for Computing Machinery, New York, NY, USA, 61–68. doi:10.1145/2984511.2984538

work page doi:10.1145/2984511.2984538 2016

[12] [12]

Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A Mobile App Dataset for Building Data-Driven Design Applications. InProceedings of the 30th Annual ACM Symposium on User Interface Software and Technology(Québec City, QC, Canada)(UIST ’17). Association for Computing Machin...

work page doi:10.1145/3126594.3126651 2017

[13] [13]

Weiwei Gao, Kexin Du, Yujia Luo, Weinan Shi, Chun Yu, and Yuanchun Shi. 2024. EasyAsk: An In-App Contextual Tutorial Search Assistant for Older Adults with Voice and Touch Inputs.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies8, 3 (2024), 1–27

work page 2024

[14] [14]

Genspark. 2024. Welcome to Genspark, the AI Agent Engine. https://mainfunc. ai/blog/genspark_intro

work page 2024

[15] [15]

Aakar Gupta, Muhammed Anwar, and Ravin Balakrishnan. 2016. Porous Inter- faces for Small Screen Multitasking using Finger Identification. InProceedings of the 29th Annual Symposium on User Interface Software and Technology(Tokyo, Japan)(UIST ’16). Association for Computing Machinery, New York, NY, USA, 145–156. doi:10.1145/2984511.2984557

work page doi:10.1145/2984511.2984557 2016

[16] [16]

Nathan Hahn, Joseph Chee Chang, and Aniket Kittur. 2018. Bento Browser: Complex Mobile Search Without Tabs. InProceedings of the 2018 CHI Confer- ence on Human Factors in Computing Systems(Montreal QC, Canada)(CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/ 3173574.3173825

work page arXiv 2018

[17] [17]

Nina Hollender, Cristian Hofmann, Michael Deneke, and Bernhard Schmitz. 2010. Integrating cognitive load theory and concepts of human–computer interaction. DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking CHI ’26, April 13–17, 2026, Barcelona, Spain Computers in Human Behavior26, 6 (2010), 1278–128...

work page doi:10.1016/j.chb.2010.05 2010

[18] [18]

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. 2024. CogA- gent: A Visual Language Model for GUI Agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14281–14290

work page 2024

[19] [19]

You Can Find a Part of my Life in Every Single App

Kasper Hornbæk, Ulrik Lyngs, Olga Iarygina, and Mikael B. Skov. 2024. “You Can Find a Part of my Life in Every Single App”: An Interview Study of What Makes Smartphone Applications Special to Their Users. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New Yo...

work page doi:10.1145/3613904.3642820 2024

[20] [20]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Pittsburgh, Pennsylvania, USA)(CHI ’99). Association for Computing Machinery, New York, NY, USA, 159–166. doi:10.1145/302979.303030

work page doi:10.1145/302979.303030 1999

[21] [21]

2024.An Open Source Evaluation for Search APIs

Mehul Chadda Ishaan, Akhilesh Sharma. 2024.An Open Source Evaluation for Search APIs. https://github.com/lumina-ai-inc/benchmark

work page 2024

[22] [22]

ItzCrazyKns. 2025. Perplexica: A privacy-focused AI answering engine. GitHub repository. https://github.com/ItzCrazyKns/Perplexica Version 1.9.1

work page 2025

[23] [23]

2023.YOLO by Ultralytics

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023.YOLO by Ultralytics. https: //github.com/ultralytics/ultralytics

work page 2023

[24] [24]

Prerna Juneja, Wenjuan Zhang, Alison Marie Smith-Renner, Hemank Lamba, Joel Tetreault, and Alex Jaimes. 2024. Dissecting users’ needs for search result explanations. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 841, 17 pages. doi:...

work page doi:10.1145/3613904 2024

[25] [25]

Beata Jungselius and Alexandra Weilenmann. 2025. Tracing Change in Social Media Use: A Qualitative Longitudinal Study. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 957, 14 pages. doi:10.1145/ 3706598.3713813

work page arXiv 2025

[26] [26]

Kai. 2024. Unleash AI Search Power with Devv.AI: A Developer’s Guide. https: //devv.ai/blog/post/devvai-devs-search-guide

work page 2024

[27] [27]

Karlson, Shamsi T

Amy K. Karlson, Shamsi T. Iqbal, Brian Meyers, Gonzalo Ramos, Kathy Lee, and John C. Tang. 2010. Mobile taskflow in context: a screenshot study of smartphone usage. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Atlanta, Georgia, USA)(CHI ’10). Association for Computing Machinery, New York, NY, USA, 2009–2018. doi:10.1145/175...

work page doi:10.1145/1753326.1753631 2010

[28] [28]

Karlson, George G

Amy K. Karlson, George G. Robertson, Daniel C. Robbins, Mary P. Czerwinski, and Greg R. Smith. 2006. FaThumb: a facet-based interface for mobile search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Montréal, Québec, Canada)(CHI ’06). Association for Computing Machinery, New York, NY, USA, 711–720. doi:10.1145/1124772.1124878

work page doi:10.1145/1124772.1124878 2006

[29] [29]

Things on the Ground are Different

Lindah Kotut and Hummd Alikhan. 2024. "Things on the Ground are Different": Utility, Survival and Ethics in Multi-Device Ownership and Smartphone Sharing Contexts. InProceedings of the 2024 CHI Conference on Human Factors in Comput- ing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 760, 14 pages. doi:...

work page doi:10.1145/3613904.3642874 2024

[30] [30]

Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of Explanatory Debugging to Personalize Interactive Machine Learning. InProceedings of the 20th International Conference on Intelligent User Interfaces (Atlanta, Georgia, USA)(IUI ’15). Association for Computing Machinery, New York, NY, USA, 126–137. doi:10.1145/2678025.2701399

work page doi:10.1145/2678025.2701399 2015

[31] [31]

Dmitry Lagun, Chih-Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014. Towards better measurement of attention and satisfaction in mobile search. InProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. 113–122

work page 2014

[32] [32]

Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, and Insik Shin. 2025. Safeguarding Mobile GUI Agent via Logic-based Action Verification. arXiv:2503.18492 [cs.HC] https: //arxiv.org/abs/2503.18492

work page arXiv 2025

[33] [33]

Lee and Katrina A

John D. Lee and Katrina A. See. 2004. Trust in Automation: Designing for Appropriate Reliance.Human Factors46, 1 (2004), 50–80. doi:10.1518/hfes.46.1. 50_30392 PMID: 15151155

work page doi:10.1518/hfes.46.1 2004

[34] [34]

Explore, select, derive, and recall: Augmenting llm with human-like memory for mobile task automation.arXiv preprint arXiv:2312.03003, 2023

Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steven Y. Ko, Sangeun Oh, and Insik Shin. 2024. Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation. arXiv:2312.03003 [cs.HC] https://arxiv.org/abs/2312.03003

work page arXiv 2024

[35] [35]

Linlin Li, Ruifeng Wang, Xian Zhan, Ying Wang, Cuiyun Gao, Sinan Wang, and Yepang Liu. 2023. What You See Is What You Get? It Is Not the Case! Detecting Misleading Icons for Mobile Applications. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 538–550

work page 2023

[36] [36]

Toby Jia-Jun Li, Jingya Chen, Brandon Canfield, and Brad A. Myers. 2020. Privacy- Preserving Script Sharing in GUI-based Programming-by-Demonstration Sys- tems.Proc. ACM Hum.-Comput. Interact.4, CSCW1, Article 60 (May 2020), 23 pages. doi:10.1145/3392869

work page doi:10.1145/3392869 2020

[37] [37]

Mitchell, and Brad A

Toby Jia-Jun Li, Jingya Chen, Haijun Xia, Tom M. Mitchell, and Brad A. Myers

work page

[38] [38]

InProceedings of the 33rd Annual ACM Symposium on User Interface Soft- ware and Technology(Virtual Event, USA)(UIST ’20)

Multi-Modal Repairs of Conversational Breakdowns in Task-Oriented Dialogs. InProceedings of the 33rd Annual ACM Symposium on User Interface Soft- ware and Technology(Virtual Event, USA)(UIST ’20). Association for Computing Machinery, New York, NY, USA, 1094–1107. doi:10.1145/3379337.3415820

work page doi:10.1145/3379337.3415820

[39] [39]

Shaobo Liang and Ziyi Wei. 2024. Understanding Users’ App-Switching Be- havior During the Mobile Search: An Empirical Study from the Perspective of Push–Pull–Mooring Framework.Behavioral Sciences14, 11 (2024). doi:10.3390/ bs14110989

work page 2024

[40] [40]

Yuwen Lu, Meng Chen, Qi Zhao, Victor Cox, Yang Yang, Meng Jiang, Jay Brock- man, Tamara Kay, and Toby Jia-Jun Li. 2025. Crepe: A Mobile Screen Data Collector Using Graph Query. arXiv:2406.16173 [cs.HC] https://arxiv.org/abs/ 2406.16173

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Jiaxin Mao, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. Constructing Click Models for Mobile Search. InThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval(Ann Arbor, MI, USA)(SIGIR ’18). Association for Computing Machinery, New York, NY, USA, 775–784. doi:10. 1145/3209978.3210060

work page arXiv 2018

[42] [42]

2024.Morphic

Yoshiki Miura. 2024.Morphic. https://github.com/miurla/morphic

work page 2024

[43] [43]

2023.ChatGPT Retrieval Plugin

OpenAI. 2023.ChatGPT Retrieval Plugin. https://github.com/openai/chatgpt- retrieval-plugin

work page 2023

[44] [44]

OpenAI. 2024. SearchGPT Prototype. https://openai.com/index/searchgpt- prototype/

work page 2024

[45] [45]

OpenAI. 2025. Introducing Operator — an AI agent that can use a computer for you. https://openai.com/index/introducing-operator/

work page 2025

[46] [46]

Bigham, and Amy Pavel

Yi-Hao Peng, Dingzeyu Li, Jeffrey P. Bigham, and Amy Pavel. 2025. Morae: Proactively Pausing UI Agents for User Choices. arXiv:2508.21456 [cs.HC] https: //arxiv.org/abs/2508.21456

work page arXiv 2025

[47] [47]

Peter Pirolli and Stuart Card. 1995. Information foraging in information access environments. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’95). ACM Press/Addison- Wesley Publishing Co., USA, 51–58. doi:10.1145/223904.223911

work page doi:10.1145/223904.223911 1995

[48] [48]

Peter Pirolli and Stuart Card. 2005. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. InConference: Proceedings of International Conference on Intelligence Analysis

work page 2005

[49] [49]

Laura Pope Robbins, Lisa Esposito, Chris Kretz, and Michael Aloi. 2007. What a user wants: Redesigning a library’s web site based on a card-sort analysis.Journal of Web Librarianship1, 4 (2007), 3–27

work page 2007

[50] [50]

Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weim- ing Lu, Dongsheng Li, and Yueting Zhuang. 2024. TaskBench: Benchmarking Large Language Models for Task Automation. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associa...

work page doi:10.52202/079017-0148 2024

[51] [51]

Jaspreet Singh and Avishek Anand. 2019. EXS: Explainable Search Using Local Model Agnostic Interpretability. InProceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, New York, NY, USA, 770–773

work page 2019

[52] [52]

Chaparro

Christina Siu and Barbara S. Chaparro. 2014. First Look: Examining the Horizontal Grid Layout using Eye-tracking.Proceedings of the Human Factors and Ergonomics Society Annual Meeting58, 1 (2014), 1119–1123. arXiv:https://doi.org/10.1177/1541931214581234 doi:10.1177/1541931214581234

work page doi:10.1177/1541931214581234 2014

[53] [53]

Yunpeng Song, Yiheng Bian, Yongtao Tang, Guiyu Ma, and Zhongmin Cai. 2024. VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machinery, New York, NY, USA, Article 4...

work page arXiv 2024

[54] [54]

Minh Duc Vu, Han Wang, Jieshan Chen, Zhuang Li, Shengdong Zhao, Zhenchang Xing, and Chunyang Chen. 2024. GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning. InProceed- ings of the 37th Annual ACM Symposium on User Interface Software and Technology (Pittsburgh, PA, USA)(UIST ’24). Association for ...

work page doi:10.1145/3654777.3676356 2024

[55] [55]

Wanderboat. [n. d.]. Your everyday Al companion for getaway ideas. https: //wanderboat.ai/about

work page

[56] [56]

Bryan Wang, Gang Li, and Yang Li. 2023. Enabling Conversational Interaction with Mobile UI using Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 432, 17 pages. doi:10.1145/3544548.3580895

work page doi:10.1145/3544548.3580895 2023

[57] [57]

Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. 2021. Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning. InThe 34th Annual ACM Symposium on User Interface Software and Technology(Virtual Event, USA)(UIST ’21). Association for Computing Machinery, New York, NY, USA, 498–510. https://doi.org/10.1145/3472749.3474765

work page doi:10.1145/3472749.3474765 2021

[58] [58]

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration. CHI ’26, April 13–17, 2026, Barcelona, Spain Bian et al. arXiv preprint arXiv:2406.01014(2024)

work page arXiv 2024

[59] [59]

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. Autodroid: Llm-powered task automation in android. InProceedings of the 30th Annual International Con- ference on Mobile Computing and Networking. 543–557

work page 2024

[61] [61]

Magdalena Wischnewski, Nicole Krämer, and Emmanuel Müller. 2023. Measur- ing and Understanding Trust Calibrations for Automated Systems: A Survey of the State-Of-The-Art and Future Directions. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, A...

work page doi:10.1145/3544548.3581197 2023

[62] [62]

An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. 2023. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562(2023)

work page arXiv 2023

[63] [63]

2024.Search with Lepton

Yadong Xie Yangqing Jia. 2024.Search with Lepton. https://github.com/leptonai/ search_with_lepton

work page 2024

[64] [64]

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. 2025. Mobile-Agent-v3: Fundamental Agents for GUI Automation. arXiv:2508.15144 [cs.AI] https://arxiv.org/abs/2508.15144

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [65]

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2024. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs.arXiv preprint arXiv:2404.05719(2024)

work page arXiv 2024

[66] [66]

Chen-Hsiang Yu. 2012. Mobile continuous reading. InCHI ’12 Extended Abstracts on Human Factors in Computing Systems(Austin, Texas, USA)(CHI EA ’12). Association for Computing Machinery, New York, NY, USA, 1405–1410. doi:10. 1145/2212776.2212463

work page arXiv 2012

[67] [67]

Ja Eun Yu and Debaleena Chattopadhyay. 2024. Reducing the Search Space on demand helps Older Adults find Mobile UI Features quickly, on par with Younger Adults. InProceedings of the CHI Conference on Human Factors in Computing Systems. 1–22

work page 2024

[68] [68]

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023. AppAgent: Multimodal Agents as Smartphone Users. arXiv:2312.13771 [cs.CV] https://arxiv.org/abs/2312.13771

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

2024.SenseVoice

Shi Xian zhifu gao, Lizerui9926. 2024.SenseVoice. https://github.com/ FunAudioLLM/SenseVoice

work page 2024

[70] [70]

scrolling screenshot

Xiyue Zhu, Peng Tang, Haofu Liao, and Srikar Appalaraju. 2025. Turbocharging Web Automation: The Impact of Compressed History States. InFindings of the As- sociation for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Compu- tational Linguistics, Vienna, Austria, 3644...

work page doi:10.18653/v1/2025.findings- 2025