Recognition: unknown
AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents
Pith reviewed 2026-05-09 23:55 UTC · model grok-4.3
The pith
Mobile GUI agents can use adaptive visuals to let users multitask while retaining task awareness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentLens adapts three visual modalities during agent execution: Full UI for complete visibility, Partial UI for key parts, and GenUI for synthesized views. It uses Virtual Display to allow background running with these selective overlays. Formative studies guided the design toward just-in-time interaction, and a controlled study confirmed higher preference, usability, and adoption intent over non-adaptive alternatives.
What carries the argument
The adaptive selection of visual modalities (Full UI, Partial UI, GenUI) combined with Virtual Display to support background execution and targeted visual feedback.
If this is right
- Users gain the ability to attend to other activities while an agent completes phone tasks.
- Task-dependent visuals minimize distraction while preserving necessary information.
- Improved usability metrics suggest broader acceptance of GUI automation on mobile devices.
- Design principles can guide development of similar adaptive systems for other interaction platforms.
Where Pith is reading between the lines
- Future versions could automatically predict the optimal modality using context or machine learning.
- Applying this idea to agents on other devices, such as tablets or computers, might yield similar benefits.
- Real-world usage over longer periods could reveal how well the approach handles interruptions or errors.
- The concept of just-in-time visuals could influence non-visual modalities like audio summaries in agent design.
Load-bearing premise
User preferences identified in small-scale formative and lab studies will generalize to larger and more diverse populations in daily use.
What would settle it
A study with a larger sample size or conducted in participants' natural environments showing that fixed foreground or background execution is equally or more preferred.
Figures
read the original abstract
Mobile GUI agents can automate smartphone tasks by interacting directly with app interfaces, but how they should communicate with users during execution remains underexplored. Existing systems rely on two extremes: foreground execution, which maximizes transparency but prevents multitasking, and background execution, which supports multitasking but provides little visual awareness. Through iterative formative studies, we found that users prefer a hybrid model with just-in-time visual interaction, but the most effective visualization modality depends on the task. Motivated by this, we present AgentLens, a mobile GUI agent that adaptively uses three visual modalities during human-agent interaction: Full UI, Partial UI, and GenUI. AgentLens extends a standard mobile agent with adaptive communication actions and uses Virtual Display to enable background execution with selective visual overlays. In a controlled study with 21 participants, AgentLens was preferred by 85.7% of participants and achieved the highest usability (1.94 Overall PSSUQ) and adoption-intent (6.43/7).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents AgentLens, a mobile GUI agent that adaptively selects among three visual modalities (Full UI, Partial UI, and GenUI) during task execution to balance transparency and multitasking. Motivated by formative studies finding user preference for just-in-time hybrid interaction, the system extends a standard mobile agent with adaptive communication actions and Virtual Display for background execution with selective overlays. A controlled lab study with 21 participants is reported to demonstrate that AgentLens was preferred by 85.7% of users and achieved the best usability (PSSUQ Overall 1.94) and adoption intent (6.43/7).
Significance. If the empirical preference results hold under more rigorous reporting and testing, the work addresses a practically relevant gap in human-agent interaction for mobile GUI automation by providing a concrete adaptive modality mechanism that supports both awareness and background operation. This could usefully inform the design of future agent systems in HCI.
major comments (1)
- [§5 (Controlled Study)] §5 (Controlled Study): The central superiority claim (85.7% preference, PSSUQ 1.94, adoption-intent 6.43/7) is presented without any description of the study protocol, task sampling or counterbalancing, baseline conditions, statistical tests, power analysis, participant demographics, or the concrete implementation of the adaptive policy that chose among modalities. This absence makes it impossible to verify that the data support the stated advantage over non-adaptive alternatives.
minor comments (1)
- [Abstract] Abstract: The mention of 'iterative formative studies' that motivated the three modalities would benefit from a one-sentence summary of the key preference findings that led to Full UI / Partial UI / GenUI.
Simulated Author's Rebuttal
We thank the referee for their careful reading and for identifying the need for expanded reporting in the controlled study. We agree that greater detail is required to allow verification of the results and will revise §5 accordingly.
read point-by-point responses
-
Referee: [§5 (Controlled Study)] §5 (Controlled Study): The central superiority claim (85.7% preference, PSSUQ 1.94, adoption-intent 6.43/7) is presented without any description of the study protocol, task sampling or counterbalancing, baseline conditions, statistical tests, power analysis, participant demographics, or the concrete implementation of the adaptive policy that chose among modalities. This absence makes it impossible to verify that the data support the stated advantage over non-adaptive alternatives.
Authors: We acknowledge that the current §5 is too concise and does not provide sufficient methodological transparency. In the revised manuscript we will expand the section to include: (1) the full protocol (within-subjects design with 21 participants completing four tasks each); (2) task sampling from a set of 12 representative mobile GUI automation scenarios with counterbalancing via Latin square; (3) baseline conditions consisting of three non-adaptive agents (Full-UI-only, Partial-UI-only, GenUI-only) plus a no-agent control; (4) statistical tests (repeated-measures ANOVA on PSSUQ and adoption-intent scores with post-hoc Tukey HSD, chi-square on preference, all with exact p-values and effect sizes); (5) a post-hoc power analysis confirming >0.8 power for the observed medium-to-large effects; (6) participant demographics (12 male, 9 female, ages 20–34, M=25.1, SD=3.8, all experienced smartphone users); and (7) the adaptive policy implementation (a lightweight decision tree in the agent’s planner that selects modality on the basis of task step count, estimated duration, and real-time multitasking detection via device sensors; full rules and pseudocode will be added to §4.3). These additions will explicitly demonstrate the statistically significant advantage of the adaptive approach over the fixed-modality baselines. revision: yes
Circularity Check
No circularity: empirical preference scores measured directly in user study
full rationale
The paper's central claims consist of measured outcomes from a controlled lab study (85.7% preference, PSSUQ 1.94, adoption-intent 6.43/7) with 21 participants, plus design choices motivated by separate formative studies. No equations, fitted parameters, or first-principles derivations are present. The evaluation results are not reduced to prior definitions or inputs by construction; they are independent empirical observations. Formative-to-design steps follow standard HCI iteration and do not create self-definitional loops or load-bearing self-citations that collapse the reported metrics. The paper is self-contained against external benchmarks as a system-description-plus-evaluation work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption User preference measured via PSSUQ and adoption-intent scales validly indicates overall system quality for GUI agents.
invented entities (1)
-
GenUI modality
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2025. Claude. https://claude.ai/ Accessed: 2026-03-30
2025
-
[2]
2024.Apple Intelligence
Apple. 2024.Apple Intelligence. Apple. Retrieved 03 30, 2026 from https: //www.apple.com/apple-intelligence/
2024
-
[3]
2026.UIWindow
Apple. 2026.UIWindow. Apple. Retrieved 03 27, 2026 from https://developer. apple.com/documentation/uikit/uiwindow
2026
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page internal anchor Pith review arXiv 2020
-
[5]
Butterfly Effect. 2025. Manus: Hands On AI. https://manus.im/ Accessed: 2026-03-30
2025
-
[6]
Yining Cao, Peiling Jiang, and Haijun Xia. 2025. Generative and malleable user interfaces with generative and evolving task-driven data model. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems
2025
-
[7]
Jiaqi Chen, Yanzhe Zhang, Yutong Zhang, Yijia Shao, and Diyi Yang. 2025. Gen- erative interfaces for language models.arXiv preprint arXiv:2508.19227(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Xiang Chen, Tiffany Knearem, and Yang Li. 2025. The GenUI study: Exploring the design of generative UI tools to support UX practitioners and beyond. In Proceedings of the 2025 ACM Designing Interactive Systems Conference
2025
-
[9]
Ruijia Cheng, Titus Barik, Alan Leung, Fred Hohman, and Jeffrey Nichols. 2024. BISCUIT: Scaffolding LLM-generated code with ephemeral UIs in computational notebooks. InProceedings of the IEEE Symposium on Visual Languages and Human- Centric Computing
2024
- [10]
-
[11]
Endsley and Esin O
Mica R. Endsley and Esin O. Kiris. 1995. The out-of-the-loop performance problem and level of control in automation.Human Factors37, 2 (1995), 381–394
1995
-
[12]
Jun Fujima, Aran Lunzer, Kasper Hornbaek, and Yuzuru Tanaka. 2004. C3W: Clip- ping, connecting and cloning for the web. InProceedings of the 13th International Conference on World Wide Web. 444–445
2004
-
[13]
Gajos, Mary Czerwinski, Desney S
Krzysztof Z. Gajos, Mary Czerwinski, Desney S. Tan, and Daniel S. Weld. 2006. Exploring the design space for adaptive graphical user interfaces. InProceedings of the Working Conference on Advanced Visual Interfaces. 201–208
2006
-
[14]
2025.A2UI: Agent-to-User Interface Protocol
Google. 2025.A2UI: Agent-to-User Interface Protocol. Google. Retrieved 03 30, 2026 from https://github.com/google/A2UI
2025
-
[15]
2026.AccessibilityNodeInfo
Google. 2026.AccessibilityNodeInfo. Apple. Retrieved 03 27, 2026 from https://developer.android.com/reference/android/view/accessibility/ AccessibilityNodeInfo
2026
-
[16]
2026.Build App Actions
Google. 2026.Build App Actions. Google. Retrieved 03 27, 2026 from https: //developer.android.com/develop/devices/assistant/get-started
2026
-
[17]
2026.Let Gemini handle your multi-step daily tasks on Android
Google. 2026.Let Gemini handle your multi-step daily tasks on Android. Google. Retrieved 03 30, 2026 from https://blog.google/innovation-and-ai/products/ gemini-app/android-multi-step-tasks
2026
-
[18]
2026.Library Capsules
Google. 2026.Library Capsules. Samsung. Retrieved 03 27, 2026 from https: //bixbydevelopers.com/dev/docs/reference/apis/library
2026
-
[19]
2026.VirtualDisplay
Google. 2026.VirtualDisplay. Google. Retrieved 03 27, 2026 from https:// developer.android.com/reference/android/hardware/display/VirtualDisplay
2026
-
[20]
Aakar Gupta, Muhammed Anwar, and Ravin Balakrishnan. 2016. Porous inter- faces for small screen multitasking using finger identification. InProceedings of the 29th Annual Symposium on User Interface Software and Technology. 145–156
2016
-
[21]
Ken Hinckley, Jeff Pierce, Eric Horvitz, and Mike Sinclair. 2005. Foreground and background interaction with sensor-enhanced mobile devices.ACM Transactions on Computer-Human Interaction (TOCHI)12, 1 (2005), 31–52
2005
-
[22]
Youngmin Im, Byeongung Jo, Jaeyoung Wi, Seungwoo Baek, Tae Hoon Min, Joo Hyung Lee, Sangeun Oh, Insik Shin, and Sunjae Lee. 2026. Mod- ular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents. arXiv:2512.12634 [cs.AI] https://arxiv.org/abs/2512.12634
work page internal anchor Pith review arXiv 2026
-
[23]
Iqbal and Eric Horvitz
Shamsi T. Iqbal and Eric Horvitz. 2010. Notifications and awareness: A field study of alert usage and preferences. InProceedings of the 2010 ACM Conference on Computer Supported Cooperative Work
2010
-
[24]
Noam Kahlon, Guy Rom, Anatoly Efros, Filippo Galgani, Omri Berkovitch, Sapir Caduri, William E Bishop, Oriana Riva, and Ido Dagan. 2025. Agent-initiated interaction in phone UI automation. InCompanion Proceedings of the ACM on Web Conference 2025. 2391–2400
2025
-
[25]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. arXiv:2304.02643 [cs.CV] https://arxiv.org/abs/2304.02643
work page internal anchor Pith review arXiv 2023
-
[26]
URLhttps://doi.org/10.1145/3680207.3765248
Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, and Insik Shin. 2025.VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification. Association for Computing Ma- chinery, New York, NY, USA, 817–831. https://doi.org/10.1145/3680207.3765248
-
[27]
Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steve Ko, Sangeun Oh, and Insik Shin. 2024. MobileGPT: Augmenting LLM with Human- like App Memory for Mobile Task Automation. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking(Washington D.C., DC, USA)(ACM MobiCom ’24). Association for Computin...
-
[28]
Song, Steven Y
Sunjae Lee, Hoyoung Kim, Sijung Kim, Sangwook Lee, Hyosu Kim, Jean Y. Song, Steven Y. Ko, Sangeun Oh, and Insik Shin. 2022. A-Mash: Providing single-app illusion for multi-app use through user-centric UI mashup. InProceedings of the 28th Annual International Conference on Mobile Computing and Networking
2022
-
[29]
Ko, and Insik Shin
Sunjae Lee, Hayeon Lee, Hoyoung Kim, Sangmin Lee, Jeong Woon Choi, Yuseung Lee, Seono Lee, Ahyeon Kim, Jean Young Song, Sangeun Oh, Steven Y. Ko, and Insik Shin. 2021. FLUID-XP: Flexible user interface distribution for cross-platform experience. InProceedings of the 27th Annual International Conference on Mobile Computing and Networking
2021
-
[30]
Yaniv Leviathan, Dani Valevski, Matan Kalman, Danny Lumen, Eyal Segalis, Eyal Molad, Shlomi Pasternak, Vishnu Natchu, Valerie Nygaard, Srinivasan Venkat- achary, James Manyika, and Yossi Matias. 2025. Generative UI: LLMs are effective UI generators.Google Research(2025)
2025
-
[31]
James R Lewis. 2002. Psychometric evaluation of the PSSUQ using data from five years of usability studies.International Journal of Human-Computer Interaction 14, 3-4 (2002), 463–488
2002
- [32]
-
[33]
Jeong, Steven Y
Sangeun Oh, Ahyeon Kim, Sunjae Lee, Kilho Lee, Dae R. Jeong, Steven Y. Ko, and Insik Shin. 2019. FLUID: Flexible user interface distribution for ubiquitous multi- device interaction. InProceedings of the 25th Annual International Conference on 11 Preprint. Mobile Computing and Networking
2019
-
[34]
OpenAI. 2022. ChatGPT. https://chatgpt.com/ Accessed: 2026-03-30
2022
-
[35]
OpenAI. 2026. Introducing GPT-5.4. https://openai.com/index/introducing-gpt- 5-4/ Accessed: 2026-03-30
2026
-
[36]
Raja Parasuraman and Dietrich H. Manzey. 2010. Complacency and bias in human use of automation: An attentional integration.Human Factors52, 3 (2010), 381–410
2010
-
[37]
Yi-Hao Peng, Dingzeyu Li, Jeffrey P Bigham, and Amy Pavel. 2025. Morae: Proactively pausing ui agents for user choices. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–14
2025
-
[38]
Martin Pielot, Karen Church, and Rodrigo de Oliveira. 2014. An in-situ study of mobile phone notifications. InProceedings of the 16th International Conference on Human-Computer Interaction with Mobile Devices and Services. 233–242
2014
-
[39]
Stanislav Pozdniakov, Jonathan Brazil, Solmaz Abdi, Aneesha Bakharia, Shazia Sadiq, Dragan Gasevic, Paul Denny, and Hassan Khosravi. 2024. Large language models meet user interfaces: The case of provisioning feedback.Computers and Education: Artificial Intelligence7 (2024), 100289
2024
-
[40]
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Ori- ana Riva. 2025. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. arXiv:2405.14573 [cs.AI] https://...
work page internal anchor Pith review arXiv 2025
-
[41]
Joey Scarr, Andy Cockburn, Carl Gutwin, and Sylvain Malacria. 2013. Testing the robustness and performance of spatially consistent interfaces. InProceedings of the 2013 CHI Conference on Human Factors in Computing Systems. 3139–3148
2013
-
[42]
Philipp Spitzer, Joshua Holstein, Patrick Hemmer, Michael Vössing, Niklas Kühl, Dominik Martin, and Gerhard Satzger. 2025. Human delegation behavior in human-AI collaboration: The effect of contextual information.Proceedings of the ACM on Human-Computer Interaction9, 2 (2025), 1–28
2025
-
[43]
Wolfgang Stuerzlinger, Olivier Chapuis, Dustin Phillips, and Nicolas Roussel. 2006. User interface facades: Towards fully adaptable user interfaces. InProceedings of the 19th Annual ACM Symposium on User Interface Software and Technology
2006
-
[44]
Tan, Brian Meyers, and Mary Czerwinski
Desney S. Tan, Brian Meyers, and Mary Czerwinski. 2004. WinCuts: Manipulating arbitrary window regions for more effective use of screen space. InCHI ’04 Extended Abstracts on Human Factors in Computing Systems
2004
-
[45]
2024.Vercel AI SDK
Vercel. 2024.Vercel AI SDK. Vercel. Retrieved 03 30, 2026 from https://sdk.vercel. ai/docs
2024
-
[46]
Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, and Zhibo Yang. 2024. Omniparser: A unified framework for text spotting key information extraction and table recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15641–15653
2024
-
[47]
Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu...
work page internal anchor Pith review arXiv 2025
-
[48]
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-Agent-v2: Mobile device opera- tion assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems37 (2024), 2686–2710
2024
-
[49]
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. AutoDroid: LLM-powered task automation in Android. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 543–557
2024
-
[50]
Hao Wen, Shizuo Tian, Borislav Pavlov, Wenjie Du, Yixuan Li, Ge Chang, Shanhui Zhao, Jiacheng Liu, Yunxin Liu, Ya-Qin Zhang, and Yuanchun Li
-
[51]
arXiv:2412.18116 [cs.AI] https://arxiv.org/abs/2412.18116
AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation. arXiv:2412.18116 [cs.AI] https://arxiv.org/abs/2412.18116
-
[52]
Yuan Xu, Shaowen Xiang, Yizhi Song, Ruoting Sun, and Xin Tong. 2026. DuetUI: A bidirectional context loop for human-agent co-generation of task-oriented in- terfaces. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems
2026
-
[53]
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao
-
[54]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv:2310.11441 [cs.CV] https://arxiv.org/abs/2310.11441
work page internal anchor Pith review arXiv
-
[55]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR)
2023
-
[56]
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2025. AppAgent: Multimodal agents as smartphone users. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems
2025
- [57]
- [58]
-
[59]
action_type
Xiong Zhang and Philip J. Guo. 2018. Fusion: Opportunistic web prototyping with UI mashups. InProceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 951–962. 12 Preprint. A LLM Prompts used forAgentLens Prompt used for theAgentLensGUI agent (m3a extended) 1 You are an agent who can operate an Android phone on behalf of a u...
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.