DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking
Pith reviewed 2026-05-22 16:57 UTC · model grok-4.3
The pith
DroidRetriever uses a multi-LLM pipeline and live dashboard to let users monitor, steer, and intervene in cross-app mobile searches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DroidRetriever accepts voice or typed input and employs a multi-LLM system to decompose tasks, navigate target pages, take screenshots, and synthesize concise reports with citation-linked screenshots; transparency is achieved through a progress dashboard that combines sub-task status with real-time exploration maps, allowing seamless user takeover, while the system pauses on detected privacy or high-risk screens to prompt intervention.
What carries the argument
The progress dashboard that merges sub-task progress indicators with real-time exploration maps, backed by the multi-LLM pipeline for task decomposition, navigation, screenshot capture, and report synthesis.
If this is right
- Final reports include citation-linked screenshots that let users verify each piece of information against its source screen.
- The system pauses automatically before displaying or acting on privacy-sensitive or high-risk content.
- Users avoid repetitive context switching and data re-entry because the automation maintains state across apps.
- Overall coverage rises as the system systematically explores multiple sources while the dashboard keeps the user informed.
Where Pith is reading between the lines
- The same combination of visible maps and pause points could reduce opacity in automation tools built for desktop or web environments.
- Repeated successful handoffs between agent and user may encourage designers to add explicit takeover features to other personal-data agents.
- Over time the dashboard data could reveal common intervention patterns that inform better default behaviors for future versions.
Load-bearing premise
The multi-LLM system can reliably decompose queries, navigate through diverse apps, and produce accurate reports without frequent errors or getting stuck.
What would settle it
If evaluation on the 35 tasks shows frequent navigation failures, incomplete reports, or no measurable drop in user workload and context switching, the claimed improvements would not hold.
Figures
read the original abstract
Information seeking on mobile devices is often fragmented, trapping users in repetitive cycles of context switching and data re-entry, which increases cognitive load and disrupts workflow. Existing mobile agents provide limited cross-source integration and are largely opaque, presenting progress as a linear feed with few opportunities to intervene, steer, or take control. We present DroidRetriever, a transparent, steerable system for cross-source mobile information seeking. It accepts voice or typed input and the multi-LLM system decomposes the task, navigates to target pages, takes screenshots, and synthesizes a concise report with citation-linked screenshots. We make the process transparent through a progress dashboard combining sub-task progress and real-time exploration maps for seamless takeover. DroidRetriever also pauses on detected privacy or high-risk screens and prompts intervention. Across 35 tasks over 24 apps, experiments and user studies demonstrate improvements in coverage, transparency, and reduced workload. We release our code at https://github.com/AkimotoAyako/DroidRetriever.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents DroidRetriever, a transparent and steerable automation system for collaborative mobile information seeking. It uses a multi-LLM pipeline to accept voice or typed input, decompose tasks, navigate target pages in mobile apps, capture screenshots, and synthesize concise reports with citation-linked screenshots. Transparency is achieved via a progress dashboard that combines sub-task progress with real-time exploration maps, enabling seamless user takeover, while the system pauses on detected privacy or high-risk screens for intervention. The authors report that experiments and user studies across 35 tasks over 24 apps demonstrate improvements in coverage, transparency, and reduced workload, and they release the code publicly.
Significance. If the empirical claims hold after detailed reporting, the work could advance HCI research on mobile agents by demonstrating a practical balance between automation and user steerability in cross-app information seeking. The public code release is a clear strength that supports reproducibility. The approach addresses real workflow fragmentation but its impact depends on substantiating the reliability of the multi-LLM components across variable UIs.
major comments (2)
- [Abstract and Evaluation] Abstract and Evaluation section: The claim that 'experiments and user studies demonstrate improvements in coverage, transparency, and reduced workload' across 35 tasks over 24 apps supplies no quantitative metrics, baselines, task selection criteria, statistical tests, or error analysis. Without these, the magnitude and reliability of the reported gains cannot be assessed and the central empirical contribution remains unevaluable.
- [System Architecture] System Architecture section: The multi-LLM pipeline is described as decomposing queries, navigating pages, and synthesizing reports, yet no mechanisms are specified for detecting or recovering from navigation dead-ends, OCR errors on dynamic screens, or privacy-screen misclassifications. Given that mobile UIs vary rapidly across apps, this omission is load-bearing for the claims of autonomous coverage improvements and reduced workload.
minor comments (2)
- [System Design] The progress dashboard description would benefit from additional detail on how exploration maps are rendered in real time and how user interventions are logged for later analysis.
- [Discussion] Consider adding a limitations subsection that explicitly discusses failure rates observed during the 35-task evaluation and the frequency of required user takeovers.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our empirical results and system robustness. We address each major point below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: The claim that 'experiments and user studies demonstrate improvements in coverage, transparency, and reduced workload' across 35 tasks over 24 apps supplies no quantitative metrics, baselines, task selection criteria, statistical tests, or error analysis. Without these, the magnitude and reliability of the reported gains cannot be assessed and the central empirical contribution remains unevaluable.
Authors: We agree that the current abstract and high-level summary in the Evaluation section lack the requested quantitative detail. In the revised manuscript we will expand the Evaluation section to report specific metrics including coverage rates (percentage of information items successfully retrieved), success rates across the 35 tasks, workload reduction via NASA-TLX scores, baseline comparisons against manual search and a prior mobile agent, explicit task selection criteria (tasks sampled from productivity, research, and shopping scenarios across the 24 apps), paired statistical tests with p-values, and a categorized error analysis of failure cases. These additions will make the magnitude and reliability of the gains directly assessable. revision: yes
-
Referee: [System Architecture] System Architecture section: The multi-LLM pipeline is described as decomposing queries, navigating pages, and synthesizing reports, yet no mechanisms are specified for detecting or recovering from navigation dead-ends, OCR errors on dynamic screens, or privacy-screen misclassifications. Given that mobile UIs vary rapidly across apps, this omission is load-bearing for the claims of autonomous coverage improvements and reduced workload.
Authors: We acknowledge that the nominal pipeline description omits explicit error-handling mechanisms. In the revision we will add a dedicated subsection detailing: (1) navigation dead-end detection via LLM-based page-state verification followed by backtracking or alternative path selection; (2) OCR error mitigation through multi-frame screenshot capture and cross-verification with the LLM; and (3) privacy-screen classification using an ensemble of vision-language models with confidence thresholding and fallback to user prompts. These mechanisms directly support the reliability claims for autonomous coverage and workload reduction in variable mobile UIs. revision: yes
Circularity Check
No circularity: engineering system description with no derivations or fitted predictions
full rationale
The paper presents DroidRetriever as a multi-LLM mobile automation system with a progress dashboard, evaluated on 35 tasks across 24 apps for coverage, transparency, and workload. No equations, parameters, or predictions appear in the abstract or described architecture. The contribution is an implemented system plus user studies rather than a derivation chain; nothing reduces by construction to prior fitted values or self-citations. This is the expected non-finding for a systems paper whose claims rest on empirical demonstration rather than closed-form reasoning.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multi-LLM agents can accurately decompose information-seeking tasks and navigate mobile app interfaces across diverse apps
invented entities (1)
-
Progress dashboard combining sub-task progress and real-time exploration maps
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang
-
[2]
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents. arXiv:2504.00906 [cs.AI] https://arxiv.org/abs/2504.00906
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, Yuning Jiang, and Bo Zheng. 2025. InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning. arXiv:2508.19679 [cs.AI] https://arxiv.org/abs/ 2508.19679
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Bennett, Kori Inkpen, Jaime Tee- van, Ruth Kikin-Gil, and Eric Horvitz
Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Associa...
-
[5]
anthropic. 2024. Build with Claude - Computer use (beta). https://docs.anthropic. com/en/docs/build-with-claude/computer-use
work page 2024
-
[6]
Jaime Arguello and Rob Capra. 2016. The Effects of Aggregated Search Coherence on Search Behavior.ACM Trans. Inf. Syst.35, 1, Article 2 (Sept. 2016), 30 pages. doi:10.1145/2935747
-
[7]
Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen, and Abhanshu Sharma
-
[8]
ScreenAI: A Vision-Language Model for UI and Infographics Understand- ing. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Kate Larson (Ed.). International Joint Conferences on Ar- tificial Intelligence Organization, 3058–3068. doi:10.24963/ijcai.2024/339 Main Track
-
[9]
Sunandan Chakraborty, Zohaib Jabbar, and Lakshminarayanan Subramanian
-
[10]
Summarization Search: A New Search Abstraction for Mobile Devices. InProceedings of the 2015 Annual Symposium on Computing for Development (London, United Kingdom)(DEV ’15). Association for Computing Machinery, New York, NY, USA, 69–70. doi:10.1145/2830629.2835217
-
[11]
Joseph Chee Chang, Nathan Hahn, and Aniket Kittur. 2016. Supporting Mobile Sensemaking Through Intentionally Uncertain Highlighting. InProceedings of the 29th Annual Symposium on User Interface Software and Technology(Tokyo, Japan)(UIST ’16). Association for Computing Machinery, New York, NY, USA, 61–68. doi:10.1145/2984511.2984538
-
[12]
Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A Mobile App Dataset for Building Data-Driven Design Applications. InProceedings of the 30th Annual ACM Symposium on User Interface Software and Technology(Québec City, QC, Canada)(UIST ’17). Association for Computing Machin...
-
[13]
Weiwei Gao, Kexin Du, Yujia Luo, Weinan Shi, Chun Yu, and Yuanchun Shi. 2024. EasyAsk: An In-App Contextual Tutorial Search Assistant for Older Adults with Voice and Touch Inputs.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies8, 3 (2024), 1–27
work page 2024
-
[14]
Genspark. 2024. Welcome to Genspark, the AI Agent Engine. https://mainfunc. ai/blog/genspark_intro
work page 2024
-
[15]
Aakar Gupta, Muhammed Anwar, and Ravin Balakrishnan. 2016. Porous Inter- faces for Small Screen Multitasking using Finger Identification. InProceedings of the 29th Annual Symposium on User Interface Software and Technology(Tokyo, Japan)(UIST ’16). Association for Computing Machinery, New York, NY, USA, 145–156. doi:10.1145/2984511.2984557
-
[16]
Nathan Hahn, Joseph Chee Chang, and Aniket Kittur. 2018. Bento Browser: Complex Mobile Search Without Tabs. InProceedings of the 2018 CHI Confer- ence on Human Factors in Computing Systems(Montreal QC, Canada)(CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/ 3173574.3173825
-
[17]
Nina Hollender, Cristian Hofmann, Michael Deneke, and Bernhard Schmitz. 2010. Integrating cognitive load theory and concepts of human–computer interaction. DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking CHI ’26, April 13–17, 2026, Barcelona, Spain Computers in Human Behavior26, 6 (2010), 1278–128...
-
[18]
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. 2024. CogA- gent: A Visual Language Model for GUI Agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14281–14290
work page 2024
-
[19]
You Can Find a Part of my Life in Every Single App
Kasper Hornbæk, Ulrik Lyngs, Olga Iarygina, and Mikael B. Skov. 2024. “You Can Find a Part of my Life in Every Single App”: An Interview Study of What Makes Smartphone Applications Special to Their Users. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New Yo...
-
[20]
Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Pittsburgh, Pennsylvania, USA)(CHI ’99). Association for Computing Machinery, New York, NY, USA, 159–166. doi:10.1145/302979.303030
-
[21]
2024.An Open Source Evaluation for Search APIs
Mehul Chadda Ishaan, Akhilesh Sharma. 2024.An Open Source Evaluation for Search APIs. https://github.com/lumina-ai-inc/benchmark
work page 2024
-
[22]
ItzCrazyKns. 2025. Perplexica: A privacy-focused AI answering engine. GitHub repository. https://github.com/ItzCrazyKns/Perplexica Version 1.9.1
work page 2025
-
[23]
Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023.YOLO by Ultralytics. https: //github.com/ultralytics/ultralytics
work page 2023
-
[24]
Prerna Juneja, Wenjuan Zhang, Alison Marie Smith-Renner, Hemank Lamba, Joel Tetreault, and Alex Jaimes. 2024. Dissecting users’ needs for search result explanations. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 841, 17 pages. doi:...
-
[25]
Beata Jungselius and Alexandra Weilenmann. 2025. Tracing Change in Social Media Use: A Qualitative Longitudinal Study. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 957, 14 pages. doi:10.1145/ 3706598.3713813
-
[26]
Kai. 2024. Unleash AI Search Power with Devv.AI: A Developer’s Guide. https: //devv.ai/blog/post/devvai-devs-search-guide
work page 2024
-
[27]
Amy K. Karlson, Shamsi T. Iqbal, Brian Meyers, Gonzalo Ramos, Kathy Lee, and John C. Tang. 2010. Mobile taskflow in context: a screenshot study of smartphone usage. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Atlanta, Georgia, USA)(CHI ’10). Association for Computing Machinery, New York, NY, USA, 2009–2018. doi:10.1145/175...
-
[28]
Amy K. Karlson, George G. Robertson, Daniel C. Robbins, Mary P. Czerwinski, and Greg R. Smith. 2006. FaThumb: a facet-based interface for mobile search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Montréal, Québec, Canada)(CHI ’06). Association for Computing Machinery, New York, NY, USA, 711–720. doi:10.1145/1124772.1124878
-
[29]
Things on the Ground are Different
Lindah Kotut and Hummd Alikhan. 2024. "Things on the Ground are Different": Utility, Survival and Ethics in Multi-Device Ownership and Smartphone Sharing Contexts. InProceedings of the 2024 CHI Conference on Human Factors in Comput- ing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 760, 14 pages. doi:...
-
[30]
Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of Explanatory Debugging to Personalize Interactive Machine Learning. InProceedings of the 20th International Conference on Intelligent User Interfaces (Atlanta, Georgia, USA)(IUI ’15). Association for Computing Machinery, New York, NY, USA, 126–137. doi:10.1145/2678025.2701399
-
[31]
Dmitry Lagun, Chih-Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014. Towards better measurement of attention and satisfaction in mobile search. InProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. 113–122
work page 2014
- [32]
-
[33]
John D. Lee and Katrina A. See. 2004. Trust in Automation: Designing for Appropriate Reliance.Human Factors46, 1 (2004), 50–80. doi:10.1518/hfes.46.1. 50_30392 PMID: 15151155
-
[34]
Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steven Y. Ko, Sangeun Oh, and Insik Shin. 2024. Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation. arXiv:2312.03003 [cs.HC] https://arxiv.org/abs/2312.03003
-
[35]
Linlin Li, Ruifeng Wang, Xian Zhan, Ying Wang, Cuiyun Gao, Sinan Wang, and Yepang Liu. 2023. What You See Is What You Get? It Is Not the Case! Detecting Misleading Icons for Mobile Applications. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 538–550
work page 2023
-
[36]
Toby Jia-Jun Li, Jingya Chen, Brandon Canfield, and Brad A. Myers. 2020. Privacy- Preserving Script Sharing in GUI-based Programming-by-Demonstration Sys- tems.Proc. ACM Hum.-Comput. Interact.4, CSCW1, Article 60 (May 2020), 23 pages. doi:10.1145/3392869
-
[37]
Toby Jia-Jun Li, Jingya Chen, Haijun Xia, Tom M. Mitchell, and Brad A. Myers
-
[38]
Multi-Modal Repairs of Conversational Breakdowns in Task-Oriented Dialogs. InProceedings of the 33rd Annual ACM Symposium on User Interface Soft- ware and Technology(Virtual Event, USA)(UIST ’20). Association for Computing Machinery, New York, NY, USA, 1094–1107. doi:10.1145/3379337.3415820
-
[39]
Shaobo Liang and Ziyi Wei. 2024. Understanding Users’ App-Switching Be- havior During the Mobile Search: An Empirical Study from the Perspective of Push–Pull–Mooring Framework.Behavioral Sciences14, 11 (2024). doi:10.3390/ bs14110989
work page 2024
-
[40]
Yuwen Lu, Meng Chen, Qi Zhao, Victor Cox, Yang Yang, Meng Jiang, Jay Brock- man, Tamara Kay, and Toby Jia-Jun Li. 2025. Crepe: A Mobile Screen Data Collector Using Graph Query. arXiv:2406.16173 [cs.HC] https://arxiv.org/abs/ 2406.16173
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Jiaxin Mao, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. Constructing Click Models for Mobile Search. InThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval(Ann Arbor, MI, USA)(SIGIR ’18). Association for Computing Machinery, New York, NY, USA, 775–784. doi:10. 1145/3209978.3210060
- [42]
-
[43]
OpenAI. 2023.ChatGPT Retrieval Plugin. https://github.com/openai/chatgpt- retrieval-plugin
work page 2023
-
[44]
OpenAI. 2024. SearchGPT Prototype. https://openai.com/index/searchgpt- prototype/
work page 2024
-
[45]
OpenAI. 2025. Introducing Operator — an AI agent that can use a computer for you. https://openai.com/index/introducing-operator/
work page 2025
-
[46]
Yi-Hao Peng, Dingzeyu Li, Jeffrey P. Bigham, and Amy Pavel. 2025. Morae: Proactively Pausing UI Agents for User Choices. arXiv:2508.21456 [cs.HC] https: //arxiv.org/abs/2508.21456
-
[47]
Peter Pirolli and Stuart Card. 1995. Information foraging in information access environments. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’95). ACM Press/Addison- Wesley Publishing Co., USA, 51–58. doi:10.1145/223904.223911
-
[48]
Peter Pirolli and Stuart Card. 2005. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. InConference: Proceedings of International Conference on Intelligence Analysis
work page 2005
-
[49]
Laura Pope Robbins, Lisa Esposito, Chris Kretz, and Michael Aloi. 2007. What a user wants: Redesigning a library’s web site based on a card-sort analysis.Journal of Web Librarianship1, 4 (2007), 3–27
work page 2007
-
[50]
Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weim- ing Lu, Dongsheng Li, and Yueting Zhuang. 2024. TaskBench: Benchmarking Large Language Models for Task Automation. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associa...
-
[51]
Jaspreet Singh and Avishek Anand. 2019. EXS: Explainable Search Using Local Model Agnostic Interpretability. InProceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, New York, NY, USA, 770–773
work page 2019
-
[52]
Christina Siu and Barbara S. Chaparro. 2014. First Look: Examining the Horizontal Grid Layout using Eye-tracking.Proceedings of the Human Factors and Ergonomics Society Annual Meeting58, 1 (2014), 1119–1123. arXiv:https://doi.org/10.1177/1541931214581234 doi:10.1177/1541931214581234
-
[53]
Yunpeng Song, Yiheng Bian, Yongtao Tang, Guiyu Ma, and Zhongmin Cai. 2024. VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machinery, New York, NY, USA, Article 4...
-
[54]
Minh Duc Vu, Han Wang, Jieshan Chen, Zhuang Li, Shengdong Zhao, Zhenchang Xing, and Chunyang Chen. 2024. GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning. InProceed- ings of the 37th Annual ACM Symposium on User Interface Software and Technology (Pittsburgh, PA, USA)(UIST ’24). Association for ...
-
[55]
Wanderboat. [n. d.]. Your everyday Al companion for getaway ideas. https: //wanderboat.ai/about
-
[56]
Bryan Wang, Gang Li, and Yang Li. 2023. Enabling Conversational Interaction with Mobile UI using Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 432, 17 pages. doi:10.1145/3544548.3580895
-
[57]
Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. 2021. Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning. InThe 34th Annual ACM Symposium on User Interface Software and Technology(Virtual Event, USA)(UIST ’21). Association for Computing Machinery, New York, NY, USA, 498–510. https://doi.org/10.1145/3472749.3474765
-
[58]
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration. CHI ’26, April 13–17, 2026, Barcelona, Spain Bian et al. arXiv preprint arXiv:2406.01014(2024)
-
[59]
Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. Autodroid: Llm-powered task automation in android. InProceedings of the 30th Annual International Con- ference on Mobile Computing and Networking. 543–557
work page 2024
-
[61]
Magdalena Wischnewski, Nicole Krämer, and Emmanuel Müller. 2023. Measur- ing and Understanding Trust Calibrations for Automated Systems: A Survey of the State-Of-The-Art and Future Directions. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, A...
- [62]
-
[63]
Yadong Xie Yangqing Jia. 2024.Search with Lepton. https://github.com/leptonai/ search_with_lepton
work page 2024
-
[64]
Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. 2025. Mobile-Agent-v3: Fundamental Agents for GUI Automation. arXiv:2508.15144 [cs.AI] https://arxiv.org/abs/2508.15144
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [65]
- [66]
-
[67]
Ja Eun Yu and Debaleena Chattopadhyay. 2024. Reducing the Search Space on demand helps Older Adults find Mobile UI Features quickly, on par with Younger Adults. InProceedings of the CHI Conference on Human Factors in Computing Systems. 1–22
work page 2024
-
[68]
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023. AppAgent: Multimodal Agents as Smartphone Users. arXiv:2312.13771 [cs.CV] https://arxiv.org/abs/2312.13771
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Shi Xian zhifu gao, Lizerui9926. 2024.SenseVoice. https://github.com/ FunAudioLLM/SenseVoice
work page 2024
-
[70]
Xiyue Zhu, Peng Tang, Haofu Liao, and Srikar Appalaraju. 2025. Turbocharging Web Automation: The Impact of Compressed History States. InFindings of the As- sociation for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Compu- tational Linguistics, Vienna, Austria, 3644...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.