Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible

Lepeng Zhao; Shuo Li; Zhenhua Zou; Zhuotao Liu

arxiv: 2602.10139 · v3 · submitted 2026-02-08 · 💻 cs.CR · cs.AI

Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible

Lepeng Zhao , Zhenhua Zou , Shuo Li , Zhuotao Liu This is my paper

Pith reviewed 2026-05-16 05:56 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords privacy protectionGUI agentsanonymizationPII detectionmobile securitymultimodal modelsdata obfuscation

0 comments

The pith

Anonymization replaces sensitive mobile UI content with semantic placeholders so GUI agents can complete tasks without seeing private data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mobile GUI agents process full screen contents and therefore expose personal details such as phone numbers, addresses, and messages to cloud models. The paper introduces a framework that detects personally identifiable information on the device, replaces it with deterministic placeholders that keep type and category information, and routes all agent actions through a secure proxy. This setup ensures the cloud-based model never receives raw sensitive values while still receiving enough structure to reason about the interface. Experiments on AndroidLab and PrivScreen benchmarks report large drops in privacy leakage together with only modest losses in task success rate. The method is presented as achieving the strongest privacy-utility balance among current defenses.

Core claim

The framework enforces available-but-invisible access: a PII detector identifies sensitive UI elements, a UI transformer substitutes them with placeholders such as PHONE_NUMBER#a1b2c, and a layered architecture of detector, transformer, secure interaction proxy, and privacy gatekeeper keeps raw data local while allowing the agent to operate over the anonymized view across instructions, XML hierarchies, and screenshots.

What carries the argument

Deterministic type-preserving placeholders that replace detected PII while preserving semantic category information for multimodal agent reasoning.

If this is right

Privacy leakage drops substantially across several multimodal models.
Task success rate declines only modestly on the evaluated benchmarks.
The same anonymization applies consistently to user instructions, XML layouts, and screenshots.
Narrowly scoped local computations can still be invoked when raw values are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same placeholder technique could be applied to web or desktop GUI agents that face comparable screen-exposure risks.
Users could be given controls to tune detection sensitivity for different categories of information.
On-device detection models would further reduce the amount of raw screen data that ever leaves the device.

Load-bearing premise

The PII recognition model catches every sensitive element and the placeholders supply enough semantic detail for agents to reason correctly over the anonymized interface.

What would settle it

An experiment in which the agent either fails to complete tasks at the reported success rate or still leaks identifiable values through the anonymized screenshots or XML on the AndroidLab and PrivScreen benchmarks.

Figures

Figures reproduced from arXiv: 2602.10139 by Lepeng Zhao, Shuo Li, Zhenhua Zou, Zhuotao Liu.

**Figure 2.** Figure 2: Example of category-preserving anonymization of user instructions. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of screenshots before and after anonymization. The left image shows the original screen before [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Example of Type proxy resolution. The text in black regions highlights enlarged excerpts of the magenta regions to illustrate the corresponding content. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Mobile Graphical User Interface (GUI) agents have demonstrated strong capabilities in automating complex smartphone tasks by leveraging multimodal large language models (MLLMs) and system-level control interfaces. However, this paradigm introduces significant privacy risks, as agents typically capture and process entire screen contents, thereby exposing sensitive personal data such as phone numbers, addresses, messages, and financial information. Existing defenses either reduce UI exposure, obfuscate only task-irrelevant content, or rely on user authorization, but none can protect task-critical sensitive information while preserving seamless agent usability. We propose an anonymization-based privacy protection framework that enforces the principle of available-but-invisible access to sensitive data: sensitive information remains usable for task execution but is never directly visible to the cloud-based agent. Our system detects sensitive UI content using a PII-aware recognition model and replaces it with deterministic, type-preserving placeholders (e.g., PHONE_NUMBER#a1b2c) that retain semantic categories while removing identifying details. A layered architecture comprising a PII Detector, UI Transformer, Secure Interaction Proxy, and Privacy Gatekeeper ensures consistent anonymization across user instructions, XML hierarchies, and screenshots, mediates all agent actions over anonymized interfaces, and supports narrowly scoped local computations when reasoning over raw values is necessary. Extensive experiments on the AndroidLab and PrivScreen benchmarks show that our framework substantially reduces privacy leakage across multiple models while incurring only modest utility degradation, achieving the best observed privacy-utility trade-off among existing methods. Code available at: https://github.com/one-step-beh1nd/gui_privacy_protection

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical layered system for swapping PII in GUI agent screens with type-preserving placeholders, but the central privacy-utility claim rests on unquantified detector recall and placeholder effects.

read the letter

The main thing here is a framework that spots sensitive UI elements on phone screens and swaps them for deterministic placeholders like PHONE_NUMBER#a1b2c. These keep the category and structure so the agent can still reason and act, while the real data stays local or proxied. The architecture runs a PII detector, transforms the UI, routes actions through a secure proxy, and adds a gatekeeper for any needed local checks. This setup aims to make sensitive info available for tasks but invisible to the cloud model. The experiments on AndroidLab and PrivScreen are said to cut leakage across models with only modest drops in task success, and the code is released. That combination of detection, placeholder scheme, and mediation for GUI agents is not directly in the cited priors. The work does a clean job framing why prior defenses fall short and showing a concrete end-to-end flow that tries to preserve usability. Releasing the implementation is helpful for anyone who wants to try it. The soft spots are in the evaluation. The claim of the best observed trade-off depends on the detector catching nearly everything and the placeholders not confusing the downstream MLLMs. The abstract and description give no recall, precision, or false-negative numbers for the PII model on those benchmarks, and no ablations isolate how placeholder choices affect reasoning accuracy. Without those, the reported gains could be overstated if misses or semantic loss occur on even a moderate share of screens. The results summary also skips exact metrics, baselines, and variance, so it is hard to judge the margin over existing methods. This is for researchers building or securing mobile GUI agents who need applied privacy tools. A reader focused on practical multimodal agent defenses would get value from the architecture and the released code. It deserves peer review because the problem is real and the approach is implementable, even if referees will want tighter numbers on detection and placeholder impact before the trade-off claim can be taken as settled. I would send it out rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes an anonymization framework for mobile GUI agents that uses a PII-aware recognition model to detect sensitive UI content and replaces it with deterministic, type-preserving placeholders (e.g., PHONE_NUMBER#a1b2c). A layered architecture (PII Detector, UI Transformer, Secure Interaction Proxy, Privacy Gatekeeper) ensures consistent anonymization across instructions, XML, and screenshots while mediating agent actions. Experiments on AndroidLab and PrivScreen benchmarks are reported to show substantial privacy leakage reduction across models with only modest utility degradation, achieving the best observed privacy-utility trade-off among existing methods.

Significance. If the results hold after addressing the quantification gaps, the work is significant for providing a practical, available-but-invisible privacy mechanism for MLLM-based GUI agents that avoids both full UI exposure and task-irrelevant obfuscation. The open-sourced code is a positive contribution that supports reproducibility and extension.

major comments (2)

[Experiments] Experiments section: No precision, recall, or F1 scores are reported for the PII-aware recognition model on AndroidLab or PrivScreen. This is load-bearing for the central privacy-reduction claim; without near-zero false-negative rates, measured leakage reductions would be inflated.
[Experiments] Experiments section: No ablations isolate the effect of the deterministic placeholder scheme on downstream MLLM task accuracy. This undermines the utility-degradation and trade-off claims, as it is unclear whether observed performance stems from anonymization or other factors.

minor comments (1)

[Abstract] Abstract: The claim of 'best observed privacy-utility trade-off' is stated without naming the specific baselines or reporting the exact quantitative deltas used for comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to improve the manuscript. We address each major comment below and will revise the experiments section accordingly to strengthen the privacy and utility claims.

read point-by-point responses

Referee: [Experiments] Experiments section: No precision, recall, or F1 scores are reported for the PII-aware recognition model on AndroidLab or PrivScreen. This is load-bearing for the central privacy-reduction claim; without near-zero false-negative rates, measured leakage reductions would be inflated.

Authors: We agree that explicit performance metrics for the PII-aware recognition model are necessary to fully support the privacy-reduction results. Privacy leakage was measured directly via the presence of sensitive content in agent outputs and interaction traces rather than assuming perfect detection. In the revised version we will add precision, recall, and F1 scores for the detector evaluated on both AndroidLab and PrivScreen, along with a brief discussion of false-negative impact on the observed leakage figures. revision: yes
Referee: [Experiments] Experiments section: No ablations isolate the effect of the deterministic placeholder scheme on downstream MLLM task accuracy. This undermines the utility-degradation and trade-off claims, as it is unclear whether observed performance stems from anonymization or other factors.

Authors: We acknowledge that dedicated ablations would better isolate the contribution of the deterministic placeholder scheme. The current utility results compare the full anonymization pipeline against non-anonymized baselines, but do not vary the placeholder mechanism itself. We will add ablation experiments in the revision that replace deterministic placeholders with random strings or task-irrelevant tokens while keeping the rest of the pipeline fixed, thereby clarifying the source of any accuracy changes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper describes a systems framework for anonymizing sensitive UI content via PII detection and deterministic placeholders, with claims resting on experiments using the external AndroidLab and PrivScreen benchmarks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central privacy-utility results are presented as direct empirical outcomes rather than reductions to author-defined inputs by construction, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that anonymized placeholders retain sufficient semantics for agent reasoning and that the detector covers all relevant PII without false negatives.

axioms (1)

domain assumption PII-aware recognition model reliably identifies sensitive UI content across screenshots, XML, and instructions.
Invoked as the foundation for the UI Transformer and Privacy Gatekeeper components.

pith-pipeline@v0.9.0 · 5588 in / 1147 out tokens · 41644 ms · 2026-05-16T05:56:19.940949+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization
cs.AI 2026-04 unverdicted novelty 6.0

TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Appagent-pro: A proactive gui agent system for multidomain information integration and user assistance

Yuyang Zhao, Wentao Shi, Fuli Feng, and Xiangnan He. Appagent-pro: A proactive gui agent system for multidomain information integration and user assistance. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, CIKM ’25, page 6767–6771. ACM, November 2025

work page 2025
[2]

Mobile-agent-v3: Fundamental agents for gui automation, 2025

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. Mobile-agent-v3: Fundamental agents for gui automation, 2025

work page 2025
[3]

Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning, 2025

Haoming Wang, Haoyang Zou, Huatong Song, and et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning, 2025

work page 2025
[4]

L., Sun, J., Wang, J., et al

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. Autoglm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820, 2024. 14

work page arXiv 2024
[5]

Appcopilot: Toward general, accurate, long-horizon, and efficient mobile agent.arXiv preprint arXiv:2509.02444, 2025

Jingru Fan, Yufan Dang, Jingyao Wu, Huatao Li, Runde Yang, Xiyuan Yang, Yuheng Wang, and Chen Qian. Appcopilot: Toward general, accurate, long-horizon, and efficient mobile agent.arXiv preprint arXiv:2509.02444, 2025

work page arXiv 2025
[6]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

work page 2025
[7]

Core: Reducing ui exposure in mobile agents via collaboration between cloud and local llms, 2025

Gucongcong Fan, Chaoyue Niu, Chengfei Lyu, Fan Wu, and Guihai Chen. Core: Reducing ui exposure in mobile agents via collaboration between cloud and local llms, 2025

work page 2025
[8]

Dualtap: A dual-task adversarial protector for mobile mllm agents, 2025

Fuyao Zhang, Jiaming Zhang, Che Wang, Xiongtao Sun, Yurong Hao, Guowei Guan, Wenjie Li, Longtao Huang, and Wei Yang Bryan Lim. Dualtap: A dual-task adversarial protector for mobile mllm agents, 2025

work page 2025
[9]

Privweb: Unobtrusive and content-aware privacy protection for web agents, 2025

Shuning Zhang, Yutong Jiang, Rongjun Ma, Yuting Yang, Mingyao Xu, Zhixin Huang, Xin Yi, and Hewu Li. Privweb: Unobtrusive and content-aware privacy protection for web agents, 2025

work page 2025
[10]

Guiguard: Toward a general framework for privacy-preserving gui agents, 2026

Yanxi Wang, Zhiling Zhang, Wenbo Zhou, Weiming Zhang, Jie Zhang, Qiannan Zhu, Yu Shi, Shuxin Zheng, and Jiyan He. Guiguard: Toward a general framework for privacy-preserving gui agents, 2026

work page 2026
[11]

Towards trustworthy gui agents: A survey, 2025

Yucheng Shi, Wenhao Yu, Wenlin Yao, Wenhu Chen, and Ninghao Liu. Towards trustworthy gui agents: A survey, 2025

work page 2025
[12]

Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zho...

work page 2025
[13]

arXiv:2508.04482 [cs.AI] https://arxiv.org/abs/2508.04482

Xueyu Hu, Tao Xiong, Biao Yi, et al. Os agents: A survey on mllm-based agents for general computing devices. arXiv preprint arXiv:2508.04482, 2025

work page arXiv 2025
[14]

Gui agents: A survey,

Dang Nguyen, Jian Chen, Yu Wang, et al. Gui agents: A survey.arXiv preprint arXiv:2412.13501, 2024

work page arXiv 2024
[15]

Mind the third eye! benchmarking privacy awareness in mllm-powered smartphone agents, 2025

Zhixin Lin, Jungang Li, Shidong Pan, Yibo Shi, Yue Yao, and Dongliang Xu. Mind the third eye! benchmarking privacy awareness in mllm-powered smartphone agents, 2025

work page 2025
[16]

Mcp-agentbench: Evaluating real-world language agent performance with mcp-mediated tools.arXiv preprint arXiv:2509.09734, 2025

Zikang Guo, Benfeng Xu, Chiwei Zhu, et al. Mcp-agentbench: Evaluating real-world language agent performance with mcp-mediated tools.arXiv preprint arXiv:2509.09734, 2025

work page arXiv 2025
[17]

arXiv preprint arXiv:2506.07672 , year=

Yunhe Yan, Shihe Wang, Jiajun Du, et al. Mcpworld: A unified benchmarking testbed for api, gui, and hybrid agents.arXiv preprint arXiv:2506.07672, 2025

work page arXiv 2025
[18]

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, et al. Routerbench: A benchmark for multi-llm routing systems.arXiv preprint arXiv:2403.12031, 2024

work page internal anchor Pith review arXiv 2024
[19]

The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections

Chaoran Chen, Zhiping Zhang, Bingcan Guo, Shang Ma, Ibrahim Khalilov, Simret A Gebreegziabher, Yanfang Ye, Ziang Xiao, Yaxing Yao, Tianshi Li, and Toby Jia-Jun Li. The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections. InProceedings of the 2025 USENIX Symposium on Usable Privacy and Security (SOUPS), 2025

work page 2025
[20]

GLiNER: Generalist model for named entity recognition using bidirectional transformer

Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and Thierry Charnois. GLiNER: Generalist model for named entity recognition using bidirectional transformer. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024

work page 2024
[21]

Gliner multi-task: Generalist lightweight model for various information extraction tasks, 2024

Ihor Stepanov and Mykhailo Shtopko. Gliner multi-task: Generalist lightweight model for various information extraction tasks, 2024

work page 2024
[22]

distilbert_finetuned_ai4privacy_v2 (revision 51d7b98), 2025

Isotonic. distilbert_finetuned_ai4privacy_v2 (revision 51d7b98), 2025

work page 2025
[23]

Microsoft presidio: Open -source pii detection and anonymization framework

Microsoft. Microsoft presidio: Open -source pii detection and anonymization framework. https://github.com/ microsoft/presidio, 2025. Open-source project under MIT License

work page 2025
[24]

knowledgator/gliner-pii-large-v1.0

Knowledgator and Wordcab. knowledgator/gliner-pii-large-v1.0. https://huggingface.co/knowledgator/ gliner-pii-large-v1.0, 2025. Hugging Face pre-trained model

work page 2025
[25]

Layoutlmv3: Pre-training for document ai with unified text and image masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022

work page 2022
[26]

ScreenAI: A Vision-Language Model for UI and Infographics Understanding , year =

Gilles Baechler, Srinadh Srinivas, Ping-Yu Wang, Jason Howard, et al. Screenai: A vision-language model for ui and infographics understanding.arXiv preprint arXiv:2402.04615, 2024. 15

work page arXiv 2024
[27]

Visionllm v2: An end-to-end generalist multimodal large language model.NeurIPS, 2024

Jiannan Wu, Muyan Zhong, Sen Xing, et al. Visionllm v2: An end-to-end generalist multimodal large language model.NeurIPS, 2024

work page 2024
[28]

Gemini 2.5: Pushing the frontier of multimodal reasoning and long-context understanding.arXiv preprint, 2025

Gemini Team. Gemini 2.5: Pushing the frontier of multimodal reasoning and long-context understanding.arXiv preprint, 2025

work page 2025
[29]

Towards adversarial attack on vision-language pre-training models

Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. Proceedings of ACM Multimedia, 2022

work page 2022
[30]

Anyattack: Towards large-scale self-supervised adversarial attacks on vision-language models

Jiaming Zhang, Junhong Ye, Xingjun Ma, Yige Li, Yunfan Yang, Chen Yunhao, Jitao Sang, and Dit-Yan Yeung. Anyattack: Towards large-scale self-supervised adversarial attacks on vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[31]

Adversarial attacks against closed-source MLLMs via feature optimal alignment

Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, and Yang Liu. Adversarial attacks against closed-source MLLMs via feature optimal alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[32]

Hanene F. Z. Brachemi Meftah, Wassim Hamidouche, Sid Ahmed Fezza, and Olivier Déforges. Vip: Visual information protection through adversarial attacks on vision-language models, 2025

work page 2025
[33]

Easyocr: Ready-to-use ocr with 80+ supported languages

JaidedAI. Easyocr: Ready-to-use ocr with 80+ supported languages. https://github.com/JaidedAI/EasyOCR,

work page
[34]

Androidlab: Training and systematic benchmarking of android autonomous agents

Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. Androidlab: Training and systematic benchmarking of android autonomous agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 2144–2166, 2025

work page 2025
[35]

Qwen3 Technical Report

An Yang and et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Appagent-pro: A proactive gui agent system for multidomain information integration and user assistance

Yuyang Zhao, Wentao Shi, Fuli Feng, and Xiangnan He. Appagent-pro: A proactive gui agent system for multidomain information integration and user assistance. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, CIKM ’25, page 6767–6771. ACM, November 2025

work page 2025

[2] [2]

Mobile-agent-v3: Fundamental agents for gui automation, 2025

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. Mobile-agent-v3: Fundamental agents for gui automation, 2025

work page 2025

[3] [3]

Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning, 2025

Haoming Wang, Haoyang Zou, Huatong Song, and et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning, 2025

work page 2025

[4] [4]

L., Sun, J., Wang, J., et al

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. Autoglm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820, 2024. 14

work page arXiv 2024

[5] [5]

Appcopilot: Toward general, accurate, long-horizon, and efficient mobile agent.arXiv preprint arXiv:2509.02444, 2025

Jingru Fan, Yufan Dang, Jingyao Wu, Huatao Li, Runde Yang, Xiyuan Yang, Yuheng Wang, and Chen Qian. Appcopilot: Toward general, accurate, long-horizon, and efficient mobile agent.arXiv preprint arXiv:2509.02444, 2025

work page arXiv 2025

[6] [6]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

work page 2025

[7] [7]

Core: Reducing ui exposure in mobile agents via collaboration between cloud and local llms, 2025

Gucongcong Fan, Chaoyue Niu, Chengfei Lyu, Fan Wu, and Guihai Chen. Core: Reducing ui exposure in mobile agents via collaboration between cloud and local llms, 2025

work page 2025

[8] [8]

Dualtap: A dual-task adversarial protector for mobile mllm agents, 2025

Fuyao Zhang, Jiaming Zhang, Che Wang, Xiongtao Sun, Yurong Hao, Guowei Guan, Wenjie Li, Longtao Huang, and Wei Yang Bryan Lim. Dualtap: A dual-task adversarial protector for mobile mllm agents, 2025

work page 2025

[9] [9]

Privweb: Unobtrusive and content-aware privacy protection for web agents, 2025

Shuning Zhang, Yutong Jiang, Rongjun Ma, Yuting Yang, Mingyao Xu, Zhixin Huang, Xin Yi, and Hewu Li. Privweb: Unobtrusive and content-aware privacy protection for web agents, 2025

work page 2025

[10] [10]

Guiguard: Toward a general framework for privacy-preserving gui agents, 2026

Yanxi Wang, Zhiling Zhang, Wenbo Zhou, Weiming Zhang, Jie Zhang, Qiannan Zhu, Yu Shi, Shuxin Zheng, and Jiyan He. Guiguard: Toward a general framework for privacy-preserving gui agents, 2026

work page 2026

[11] [11]

Towards trustworthy gui agents: A survey, 2025

Yucheng Shi, Wenhao Yu, Wenlin Yao, Wenhu Chen, and Ninghao Liu. Towards trustworthy gui agents: A survey, 2025

work page 2025

[12] [12]

Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zho...

work page 2025

[13] [13]

arXiv:2508.04482 [cs.AI] https://arxiv.org/abs/2508.04482

Xueyu Hu, Tao Xiong, Biao Yi, et al. Os agents: A survey on mllm-based agents for general computing devices. arXiv preprint arXiv:2508.04482, 2025

work page arXiv 2025

[14] [14]

Gui agents: A survey,

Dang Nguyen, Jian Chen, Yu Wang, et al. Gui agents: A survey.arXiv preprint arXiv:2412.13501, 2024

work page arXiv 2024

[15] [15]

Mind the third eye! benchmarking privacy awareness in mllm-powered smartphone agents, 2025

Zhixin Lin, Jungang Li, Shidong Pan, Yibo Shi, Yue Yao, and Dongliang Xu. Mind the third eye! benchmarking privacy awareness in mllm-powered smartphone agents, 2025

work page 2025

[16] [16]

Mcp-agentbench: Evaluating real-world language agent performance with mcp-mediated tools.arXiv preprint arXiv:2509.09734, 2025

Zikang Guo, Benfeng Xu, Chiwei Zhu, et al. Mcp-agentbench: Evaluating real-world language agent performance with mcp-mediated tools.arXiv preprint arXiv:2509.09734, 2025

work page arXiv 2025

[17] [17]

arXiv preprint arXiv:2506.07672 , year=

Yunhe Yan, Shihe Wang, Jiajun Du, et al. Mcpworld: A unified benchmarking testbed for api, gui, and hybrid agents.arXiv preprint arXiv:2506.07672, 2025

work page arXiv 2025

[18] [18]

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, et al. Routerbench: A benchmark for multi-llm routing systems.arXiv preprint arXiv:2403.12031, 2024

work page internal anchor Pith review arXiv 2024

[19] [19]

The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections

Chaoran Chen, Zhiping Zhang, Bingcan Guo, Shang Ma, Ibrahim Khalilov, Simret A Gebreegziabher, Yanfang Ye, Ziang Xiao, Yaxing Yao, Tianshi Li, and Toby Jia-Jun Li. The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections. InProceedings of the 2025 USENIX Symposium on Usable Privacy and Security (SOUPS), 2025

work page 2025

[20] [20]

GLiNER: Generalist model for named entity recognition using bidirectional transformer

Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and Thierry Charnois. GLiNER: Generalist model for named entity recognition using bidirectional transformer. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024

work page 2024

[21] [21]

Gliner multi-task: Generalist lightweight model for various information extraction tasks, 2024

Ihor Stepanov and Mykhailo Shtopko. Gliner multi-task: Generalist lightweight model for various information extraction tasks, 2024

work page 2024

[22] [22]

distilbert_finetuned_ai4privacy_v2 (revision 51d7b98), 2025

Isotonic. distilbert_finetuned_ai4privacy_v2 (revision 51d7b98), 2025

work page 2025

[23] [23]

Microsoft presidio: Open -source pii detection and anonymization framework

Microsoft. Microsoft presidio: Open -source pii detection and anonymization framework. https://github.com/ microsoft/presidio, 2025. Open-source project under MIT License

work page 2025

[24] [24]

knowledgator/gliner-pii-large-v1.0

Knowledgator and Wordcab. knowledgator/gliner-pii-large-v1.0. https://huggingface.co/knowledgator/ gliner-pii-large-v1.0, 2025. Hugging Face pre-trained model

work page 2025

[25] [25]

Layoutlmv3: Pre-training for document ai with unified text and image masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022

work page 2022

[26] [26]

ScreenAI: A Vision-Language Model for UI and Infographics Understanding , year =

Gilles Baechler, Srinadh Srinivas, Ping-Yu Wang, Jason Howard, et al. Screenai: A vision-language model for ui and infographics understanding.arXiv preprint arXiv:2402.04615, 2024. 15

work page arXiv 2024

[27] [27]

Visionllm v2: An end-to-end generalist multimodal large language model.NeurIPS, 2024

Jiannan Wu, Muyan Zhong, Sen Xing, et al. Visionllm v2: An end-to-end generalist multimodal large language model.NeurIPS, 2024

work page 2024

[28] [28]

Gemini 2.5: Pushing the frontier of multimodal reasoning and long-context understanding.arXiv preprint, 2025

Gemini Team. Gemini 2.5: Pushing the frontier of multimodal reasoning and long-context understanding.arXiv preprint, 2025

work page 2025

[29] [29]

Towards adversarial attack on vision-language pre-training models

Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. Proceedings of ACM Multimedia, 2022

work page 2022

[30] [30]

Anyattack: Towards large-scale self-supervised adversarial attacks on vision-language models

Jiaming Zhang, Junhong Ye, Xingjun Ma, Yige Li, Yunfan Yang, Chen Yunhao, Jitao Sang, and Dit-Yan Yeung. Anyattack: Towards large-scale self-supervised adversarial attacks on vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[31] [31]

Adversarial attacks against closed-source MLLMs via feature optimal alignment

Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, and Yang Liu. Adversarial attacks against closed-source MLLMs via feature optimal alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[32] [32]

Hanene F. Z. Brachemi Meftah, Wassim Hamidouche, Sid Ahmed Fezza, and Olivier Déforges. Vip: Visual information protection through adversarial attacks on vision-language models, 2025

work page 2025

[33] [33]

Easyocr: Ready-to-use ocr with 80+ supported languages

JaidedAI. Easyocr: Ready-to-use ocr with 80+ supported languages. https://github.com/JaidedAI/EasyOCR,

work page

[34] [34]

Androidlab: Training and systematic benchmarking of android autonomous agents

Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. Androidlab: Training and systematic benchmarking of android autonomous agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 2144–2166, 2025

work page 2025

[35] [35]

Qwen3 Technical Report

An Yang and et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025