arxiv: 2601.18842 · v3 · submitted 2026-01-26 · 💻 cs.CR · cs.AI· cs.CV

Recognition: no theorem link

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

Yanxi Wang , Zhiling Zhang , Wenbo Zhou , Weiming Zhang , Jie Zhang , Qiannan Zhu , Yu Shi , Shuxin Zheng

show 1 more author

Jiyan He

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:28 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CV

keywords GUI agentsprivacy preservationscreenshot privacybenchmark evaluationrisk assessmenttrajectory workflowsAndroid environmentsPC interfaces

0 comments

The pith

GUIGuard-Bench shows current models detect private information in GUI screenshots but struggle with precise localization, category recognition, risk assessment, and judging task necessity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GUIGuard-Bench, a dataset of 241 real GUI-agent trajectories containing 4,080 screenshots from Android and PC settings. Each screenshot carries region-level labels for privacy bounding boxes, semantic categories, risk levels, and whether the information is required to finish the task. The benchmark supports three evaluations: how well models recognize privacy elements, whether planners stay consistent when screenshots are protected, and how different protection methods affect task success. Results indicate models commonly notice the presence of private data yet perform poorly on the finer judgments needed for safe operation. This matters because GUI agents that process raw screenshots risk exposing user identities, accounts, and behavior unless they can handle these distinctions reliably.

Core claim

GUIGuard-Bench supplies 241 trajectory-based GUI workflows with 4,080 screenshots annotated at the region level for privacy bounding boxes, categories, risk levels, and task necessity. It measures privacy recognition accuracy, offline planner fidelity after protection is applied to screenshots, and the utility cost of protection strategies. The evaluation finds that models can usually identify whether a screenshot contains private information, yet they falter on fine-grained localization, category recognition, risk assessment, and determining whether the private element is required for the task. Closed-source models maintain largely consistent planner semantics in Android environments once隐私

What carries the argument

The GUIGuard-Bench dataset of trajectory screenshots carrying region-level annotations for privacy elements, risk, and task necessity.

Load-bearing premise

The human-provided region-level annotations for privacy bounding boxes, categories, risk levels, and task necessity accurately capture real-world GUI privacy risks across the collected trajectories.

What would settle it

A new collection of GUI trajectories whose privacy regions and necessity judgments are independently verified by multiple annotators or by observing actual data leaks in controlled agent runs, then re-testing the same models on localization and necessity accuracy.

Figures

Figures reproduced from arXiv: 2601.18842 by Jie Zhang, Jiyan He, Qiannan Zhu, Shuxin Zheng, Weiming Zhang, Wenbo Zhou, Yanxi Wang, Yu Shi, Zhiling Zhang.

**Figure 2.** Figure 2: (a) OSWorld benchmark results (accuracy %), comparing representative closed-source [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The dataset structure is illustrated in the figure. It consists of 240 trajectories (4,080 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Privacy recognition results on GUIGuard Bench for both PC (blue) and Android (red) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Fine-grained privacy label recognition on GUIGuard-Bench. For private elements that pass [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Privacy Protection for GUI Screenshots at Two Levels. Risk levels: red = high risk, yellow = medium risk, green = low risk, gray = no risk. Pixel-level masking (Mask): Redact detected private regions using an opaque rectangular mask (blackout or background-color). Semantic-level replacement (Replace): Anonymize sensitive regions via (i) LLM-based text replacement (extract and rewrite private text, then re-… view at source ↗

**Figure 7.** Figure 7: Task execution and fidelity evaluation framework of GUIGuard-Bench. The grounding [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: (a) Task execution success rates on MobileWorld for the case agent with Gemini 3 as the [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: The workflow alternates between the image generation model and the GUI agent: the model [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Task distribution of the GUI agent in (a) real PC and mobile environments and (b) [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

read the original abstract

As GUI agents increasingly rely on screenshots to perceive and operate digital environments, they may inadvertently expose sensitive information such as identities, accounts, locations, and behavioral traces. While existing benchmarks primarily focus on task completion, grounding, or defenses against third-party attacks, current visual privacy datasets remain largely restricted to static natural images, limiting their ability to capture the contextual dependence and task relevance of privacy risks in GUI task trajectories. To bridge this gap, we introduce \textbf{GUIGuard-Bench}, a first-step benchmark for studying privacy-preserving GUI agents in trajectory-based GUI workflows. GUIGuard-Bench contains 241 real GUI-agent trajectories with 4,080 screenshots across Android and PC environments. Each screenshot is annotated at the region level with privacy bounding boxes, semantic privacy categories, risk levels, and whether the private information is necessary for completing the task. Built on these annotations, GUIGuard-Bench supports three complementary evaluations: privacy recognition, offline planning fidelity under protected screenshots, and the utility impact of different protection strategies. Our results show that current models can often detect whether a screenshot contains private information, but they struggle with fine-grained localization, category recognition, risk assessment, and task-necessity judgment. We also find that closed-source models, exemplified by Claude Sonnet 4.6, can maintain largely consistent planner semantics in Android environments after privacy protection is applied. Our results highlight privacy recognition as a critical bottleneck for practical GUI agents. Project: https://futuresis.github.io/GUIGuard-page/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GUIGuard-Bench introduces a new trajectory-level privacy benchmark for GUI agents that fills a real gap, but its claims rest on unvalidated subjective annotations.

read the letter

Here's the quick take on GUIGuard-Bench: it introduces a trajectory-level benchmark for privacy in GUI agents that goes beyond existing static image datasets, which is genuinely new and useful. The paper collects 241 real-world GUI agent trajectories with over 4,000 screenshots from Android and PC setups. Each screenshot gets region annotations for privacy bounding boxes, categories, risk levels, and whether the info is needed for the task. They then test models on privacy recognition, how well planners work with protected screenshots, and the impact of different protection methods. The results indicate that current models can usually tell if private information is present but have trouble with precise localization, recognizing specific categories, judging risk, and deciding if it's necessary for the task. They also note that something like Claude Sonnet can keep its planning behavior consistent in Android after applying privacy protections. Where it gets shaky is the human annotations themselves. Things like risk levels and task necessity are judgment calls that might differ between people, and the description does not include any checks for agreement between annotators or validation by privacy experts. That leaves open the possibility that some of the reported model struggles come from inconsistent labels rather than true capability gaps. The Claude finding rests on the same foundation. This paper is for anyone working on GUI agents or privacy-preserving AI systems. Someone evaluating or improving these agents would find the benchmark and the identified weaknesses helpful to consider. Overall, it should go through peer review. The idea fills a clear gap, and referees can push on the annotation details to make the results more reliable.

Referee Report

3 major / 2 minor

Summary. The paper introduces GUIGuard-Bench, a benchmark with 241 real GUI-agent trajectories and 4,080 screenshots from Android and PC environments. Each screenshot receives region-level annotations for privacy bounding boxes, semantic categories, risk levels, and task necessity. The benchmark supports three evaluations: privacy recognition by models, offline planning fidelity on protected screenshots, and utility impact of protection strategies. Key findings are that current models often detect private information presence but struggle with fine-grained localization, category recognition, risk assessment, and task-necessity judgment, while closed-source models such as Claude Sonnet 4.6 maintain largely consistent planner semantics in Android environments after privacy protection is applied.

Significance. If the human annotations prove reliable, the benchmark addresses a clear gap between static-image privacy datasets and the contextual, trajectory-based privacy risks faced by GUI agents. The dual focus on recognition failures and downstream planning consistency provides actionable diagnostics for privacy-preserving agent design. The open release of trajectories and annotations could enable reproducible follow-up work on protection mechanisms.

major comments (3)

[Dataset Construction and Annotation] Dataset annotation section: No inter-annotator agreement statistics, multiple-annotator protocol, or external validation against privacy experts are reported for the subjective labels (risk levels and task necessity). These labels directly underpin the headline claims about model struggles with risk assessment and task-necessity judgment; without agreement metrics, systematic annotator bias cannot be ruled out as an alternative explanation for the observed performance gaps.
[Evaluation Results] Evaluation results: The claims that models 'can often detect' private information yet 'struggle' with localization, categories, risk assessment, and task-necessity judgment are presented without accompanying quantitative metrics (precision/recall, accuracy, or confusion matrices), error analysis, or the full evaluation protocol. This absence prevents assessment of effect sizes and reproducibility of the reported bottlenecks.
[Offline Planning Fidelity] Planning fidelity evaluation: The consistency result for Claude Sonnet 4.6 after privacy protection is stated at a high level but lacks the concrete measurement protocol (e.g., semantic similarity metric, planner output comparison method, or control conditions) needed to substantiate that the planner semantics remain 'largely consistent.'

minor comments (2)

[Abstract] Abstract: The summary of results would be strengthened by including at least one or two key quantitative figures (e.g., detection accuracy or consistency score) rather than purely qualitative statements.
[Figures] Figure and table captions: Ensure all figures showing annotation examples or model outputs include explicit scale bars, color legends, and sample sizes so readers can interpret them without returning to the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for greater rigor and clarity.

read point-by-point responses

Referee: [Dataset Construction and Annotation] Dataset annotation section: No inter-annotator agreement statistics, multiple-annotator protocol, or external validation against privacy experts are reported for the subjective labels (risk levels and task necessity). These labels directly underpin the headline claims about model struggles with risk assessment and task-necessity judgment; without agreement metrics, systematic annotator bias cannot be ruled out as an alternative explanation for the observed performance gaps.

Authors: We agree that inter-annotator agreement metrics are essential for subjective labels. The revised manuscript will include a detailed description of the multiple-annotator protocol (three annotators per label with disagreement resolution via discussion) and report agreement statistics such as Fleiss' kappa for risk levels and task necessity. We will also acknowledge that external validation by privacy experts was not performed and discuss this as a limitation. revision: yes
Referee: [Evaluation Results] Evaluation results: The claims that models 'can often detect' private information yet 'struggle' with localization, categories, risk assessment, and task-necessity judgment are presented without accompanying quantitative metrics (precision/recall, accuracy, or confusion matrices), error analysis, or the full evaluation protocol. This absence prevents assessment of effect sizes and reproducibility of the reported bottlenecks.

Authors: We agree that the evaluation section would benefit from explicit quantitative support. The revised manuscript will add precision, recall, accuracy, confusion matrices, and a dedicated error analysis subsection for the privacy recognition tasks. The full evaluation protocol will be described in the main text (with additional details moved from the appendix) to ensure reproducibility. revision: yes
Referee: [Offline Planning Fidelity] Planning fidelity evaluation: The consistency result for Claude Sonnet 4.6 after privacy protection is stated at a high level but lacks the concrete measurement protocol (e.g., semantic similarity metric, planner output comparison method, or control conditions) needed to substantiate that the planner semantics remain 'largely consistent.'

Authors: We will expand the offline planning fidelity section in the revision to specify the concrete protocol. This will include the semantic similarity metric (embedding cosine similarity), the exact method for comparing planner outputs, and the control conditions used to support the consistency claim for Claude Sonnet 4.6 in Android environments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct observations

full rationale

The paper constructs GUIGuard-Bench from 241 trajectories and 4,080 human-annotated screenshots, then reports direct model evaluations on privacy detection, localization, and planning fidelity. No equations, fitted parameters, or predictions appear; results are observational comparisons against the annotations rather than derivations that reduce to inputs by construction. Self-citations are absent from load-bearing claims, and the benchmark is externally falsifiable via the released annotations and trajectories.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the creation and annotation of a new dataset plus empirical tests on existing models; no free parameters are introduced, and the only notable assumption is the accuracy of human annotations.

axioms (1)

domain assumption Human annotations for privacy bounding boxes, semantic categories, risk levels, and task necessity are accurate and unbiased.
All three supported evaluations depend directly on these labels being reliable.

pith-pipeline@v0.9.0 · 5604 in / 1392 out tokens · 51506 ms · 2026-05-16T11:28:23.522972+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contrastive Privacy: A Semantic Approach to Measuring Privacy of AI-based Sanitization
cs.CR 2026-05 unverdicted novelty 7.0

Contrastive privacy is a new corpus-contrast test for semantic privacy in AI-sanitized media that uses latent concept measures and requires no manual labeling.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
cs.AI 2026-04 unverdicted novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
cs.CL 2026-05 unverdicted novelty 6.0

Phone-use agents avoid harm more often through inability to act than through deliberate safe choices, so benchmarks must separate unsafe judgment from capability failure.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · cited by 4 Pith papers · 4 internal anchors

[1]

https://blog

Google DeepMind.Project Astra: A Universal Multimodal AI Assistant. https://blog. google/technology/google-deepmind/gemini-universal-ai-assistant/. 2025

work page 2025
[2]

Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

Mathieu Andreux et al. “Surfer 2: The Next Generation of Cross-Platform Computer Use Agents”. In:arXiv preprint arXiv:2510.19949(2025)

work page arXiv 2025
[3]

GUI Agents: A Survey

Dang Nguyen et al. “GUI Agents: A Survey”. In:Findings of the Association for Compu- tational Linguistics: ACL 2025. Ed. by Wanxiang Che et al. Vienna, Austria: Association for Computational Linguistics, July 2025, pp. 22522–22538.DOI: 10.18653/v1/2025. findings-acl.1158

work page doi:10.18653/v1/2025 2025
[4]

GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

Chiyu Chen et al. “GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?” In:arXiv preprint arXiv:2510.20333(2025)

work page arXiv 2025
[5]

Product webpage

ByteDance (Doubao Team).Doubao Phone Assistant (Technical Preview). Product webpage. Launched Dec 1, 2025. Accessed Dec 19, 2025. Dec. 2025.URL: https://o.doubao.com/. 19

work page 2025
[6]

Product webpage

Alibaba Cloud.Wuying AgentBay (AgentBay): All-scenario AI Agent Execution Platform. Product webpage. Accessed Dec 19, 2025. 2025.URL: https : / / www . aliyun . com / product/agentbay

work page 2025
[7]

OpenAI product announcement

OpenAI.Introducing ChatGPT Atlas. OpenAI product announcement. Accessed Dec 19,

work page
[8]

2025.URL:https://openai.com/index/introducing-chatgpt-atlas/

Oct. 2025.URL:https://openai.com/index/introducing-chatgpt-atlas/

work page 2025
[9]

GPT-4 as a source of patient information for anterior cervical discectomy and fusion: a comparative analysis against Google web search

Paul G Mastrokostas et al. “GPT-4 as a source of patient information for anterior cervical discectomy and fusion: a comparative analysis against Google web search”. In:Global Spine Journal14.8 (2024), pp. 2389–2398

work page 2024
[10]

Exploring the potential of large language models and generative artifi- cial intelligence (GPT): Applications in Library and Information Science

Matus Formanek. “Exploring the potential of large language models and generative artifi- cial intelligence (GPT): Applications in Library and Information Science”. In:Journal of Librarianship and Information Science57.2 (2025), pp. 568–590

work page 2025
[11]

Large language models empowered personalized web agents

Hongru Cai et al. “Large language models empowered personalized web agents”. In:Pro- ceedings of the ACM on Web Conference 2025. 2025, pp. 198–215

work page 2025
[12]

Bearcubs: A benchmark for computer-using web agents

Yixiao Song et al. “Bearcubs: A benchmark for computer-using web agents”. In:arXiv preprint arXiv:2503.07919(2025)

work page arXiv 2025
[13]

A survey of webagents: Towards next-generation ai agents for web automation with large foundation models

Liangbo Ning et al. “A survey of webagents: Towards next-generation ai agents for web automation with large foundation models”. In:Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2. 2025, pp. 6140–6150

work page 2025
[14]

Websight: A vision-first architecture for robust web agents

Tanvir Bhathal and Asanshay Gupta. “Websight: A vision-first architecture for robust web agents”. In:arXiv preprint arXiv:2508.16987(2025)

work page arXiv 2025
[15]

Gui testing arena: A unified benchmark for advancing autonomous gui testing agent

Kangjia Zhao et al. “Gui testing arena: A unified benchmark for advancing autonomous gui testing agent”. In:arXiv preprint arXiv:2412.18426(2024)

work page arXiv 2024
[16]

Towards trustworthy gui agents: A survey

Yucheng Shi et al. “Towards trustworthy gui agents: A survey”. In:arXiv preprint arXiv:2503.23434(2025)

work page arXiv 2025
[17]

Clear: Towards contextual llm-empowered privacy policy analysis and risk generation for large language model applications

Chaoran Chen et al. “Clear: Towards contextual llm-empowered privacy policy analysis and risk generation for large language model applications”. In:Proceedings of the 30th International Conference on Intelligent User Interfaces. 2025, pp. 277–297

work page 2025
[18]

Large Language Model-Brained GUI Agents: A Survey

Chaoyun Zhang et al. “Large Language Model-Brained GUI Agents: A Survey”. In:Transac- tions on Machine Learning Research2025.1 (2025)

work page 2025
[19]

Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models

Pete Janowczyk et al. “Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models”. In:arXiv preprint arXiv:2411.05056(2024)

work page arXiv 2024
[20]

EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage

Zeyi Liao et al. “EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage”. In:Proceedings of the International Conference on Representation Learning (ICLR) 2025. 2025

work page 2025
[21]

When LLMs Go Online: The Emerging Threat of Web-Enabled LLMs

Hanna Kim et al. “When LLMs Go Online: The Emerging Threat of Web-Enabled LLMs”. In:34th USENIX Security Symposium (USENIX Security 25). 2025, pp. 1729–1748

work page 2025
[22]

Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation

Yuyang Wanyan et al. “Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation”. In:arXiv preprint arXiv:2506.04614(2025)

work page arXiv 2025
[23]

Anthropic.Agents and Tools: Computer Use. Online. Accessed: March 16, 2025. 2025

work page 2025
[25]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang et al. “Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning”. In:arXiv preprint arXiv:2509.02544(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Privacyasst: Safeguarding user privacy in tool-using large language model agents

Xinyu Zhang et al. “Privacyasst: Safeguarding user privacy in tool-using large language model agents”. In:IEEE Transactions on Dependable and Secure Computing21.6 (2024), pp. 5242–5258

work page 2024
[27]

Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning

Zhen Xiang et al. “Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning”. In:arXiv preprint arXiv:2406.09187(2024)

work page arXiv 2024
[28]

OpenAI.Computer-using Agent. Online. Accessed: March 16, 2025. 2025

work page 2025
[29]

Adversaflow: Visual red teaming for large language models with multi- level adversarial flow

Dazhen Deng et al. “Adversaflow: Visual red teaming for large language models with multi- level adversarial flow”. In:IEEE Transactions on Visualization and Computer Graphics (2024)

work page 2024
[30]

Autodroid: Llm-powered task automation in android

Hao Wen et al. “Autodroid: Llm-powered task automation in android”. In:Proceedings of the 30th Annual International Conference on Mobile Computing and Networking. 2024, pp. 543–557. 20

work page 2024
[31]

Mla-trust: Benchmarking trustworthiness of multimodal llm agents in gui environments

Xiao Yang et al. “Mla-trust: Benchmarking trustworthiness of multimodal llm agents in gui environments”. In:arXiv preprint arXiv:2506.01616(2025)

work page arXiv 2025
[32]

Caution for the environment: Multimodal agents are susceptible to environ- mental distractions

Xinbei Ma et al. “Caution for the environment: Multimodal agents are susceptible to environ- mental distractions”. In:arXiv preprint arXiv:2408.02544(2024)

work page arXiv 2024
[33]

OpenAI.GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum. Tech. rep. Ope- nAI, 2025.URL: https : / / cdn . openai . com / pdf / 4173ec8d - 1229 - 47db - 96de - 06d87147e07e/5_1_system_card.pdf

work page 2025
[34]

https : / / ai

Google DeepMind.Gemini 3 Developer Guide. https : / / ai . google . dev / gemini - api/docs/gemini-3. 2025

work page 2025
[35]

Anthropic.Claude 4.5 Models.https://platform.claude.com/docs/models. 2025

work page 2025
[36]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano et al. “WebGPT: Browser-assisted question-answering with human feed- back”. In:arXiv preprint arXiv:2112.09332(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao et al. “ReAct: Synergizing Reasoning and Acting in Language Models”. In: International Conference on Learning Representations. 2023

work page 2023
[38]

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

Hiroki Furuta et al. “Multimodal Web Navigation with Instruction-Finetuned Foundation Models”. In:International Conference on Learning Representations. 2024

work page 2024
[39]

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyuan Zheng et al. “GPT-4V(ision) is a Generalist Web Agent, if Grounded”. In:Proceed- ings of the 41st International Conference on Machine Learning. Ed. by Ruslan Salakhutdinov et al. V ol. 235. Proceedings of Machine Learning Research. PMLR, 2024, pp. 61349–61385

work page 2024
[40]

AppAgent: Multimodal Agents as Smartphone Users

Chi Zhang et al. “AppAgent: Multimodal Agents as Smartphone Users”. In:arXiv preprint arXiv:2312.13771(2023)

work page arXiv 2023
[41]

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

Yanda Li et al. “AppAgent v2: Advanced Agent for Flexible Mobile Interactions”. In:arXiv preprint arXiv:2408.11824(2024)

work page arXiv 2024
[42]

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Peter Shaw et al. “From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces”. In:Advances in Neural Information Processing Systems. Ed. by A. Oh et al. V ol. 36. Curran Associates, Inc., 2023, pp. 34354–34370

work page 2023
[43]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng et al. “SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents”. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 9313–9332.DOI:10.18653/v1/2024.acl-long.505

work page doi:10.18653/v1/2024.acl-long.505 2024
[44]

ScreenAgent: A Vision Language Model-driven Computer Control Agent

Runliang Niu et al. “ScreenAgent: A Vision Language Model-driven Computer Control Agent”. In:Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI-24). IJCAI, 2024, pp. 6433–6441.DOI:10.24963/ijcai.2024/711

work page doi:10.24963/ijcai.2024/711 2024
[45]

OmniParser for Pure Vision Based GUI Agent

Yadong Lu et al. “OmniParser for Pure Vision Based GUI Agent”. In:arXiv preprint arXiv:2408.00203(2024)

work page arXiv 2024
[46]

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Keen You et al. “Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs”. In: Computer Vision – ECCV 2024. V ol. 15122. Lecture Notes in Computer Science. Springer Nature Switzerland AG, 2024, pp. 240–255.DOI:10.1007/978-3-031-73039-9_14

work page doi:10.1007/978-3-031-73039-9_14 2024
[47]

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Zhangheng Li et al. “Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms”. In:Proceedings of the International Conference on Learning Representations (ICLR) 2025. 2025

work page 2025
[48]

https : / / openai

OpenAI.Introducing Operator. https : / / openai . com / zh - Hans - CN / index / introducing-operator/. Accessed: 2025-11-23. 2024

work page 2025
[49]

https://openai.com/zh-Hans-CN/index/computer- using-agent/

OpenAI.Computer-Using Agent. https://openai.com/zh-Hans-CN/index/computer- using-agent/. Accessed: 2025-11-24. 2024

work page 2025
[50]

https : / / www

Anthropic.Developing Computer Use. https : / / www . anthropic . com / news / developing-computer-use. Accessed: 2025-11-24. 2024

work page 2025
[51]

https : / / blog

Google DeepMind.Introducing the Gemini 2.5 Computer Use model. https : / / blog . google/technology/google-deepmind/gemini-computer-use-model/ . Accessed: 2025-11-24. 2025

work page 2025
[54]

OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang et al. “OpenCUA: Open Foundations for Computer-Use Agents”. In:arXiv preprint arXiv:2508.09123(2025). 21

work page arXiv 2025
[55]

EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage

Zeyi Liao et al. “EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage”. In:Proceedings of the International Conference on Learning Representations (ICLR) 2025. Poster. 2025

work page 2025
[56]

Imprompter: Tricking LLM Agents into Improper Tool Use

Xiaohan Fu et al. “Imprompter: Tricking LLM Agents into Improper Tool Use”. In:arXiv preprint arXiv:2410.14923(2024)

work page arXiv 2024
[57]

The Obvious Invisible Threat: LLM-Powered GUI Agents’ Vulnerability to Fine-Print Injections

Chaoran Chen et al. “The Obvious Invisible Threat: LLM-Powered GUI Agents’ Vulnerability to Fine-Print Injections”. In:arXiv preprint arXiv:2504.11281(2025)

work page arXiv 2025
[58]

Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution

Meysam Alizadeh et al. “Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution”. In:arXiv preprint arXiv:2506.01055(2025)

work page arXiv 2025
[59]

Unveiling Privacy Risks in LLM Agent Memory

Bo Wang et al. “Unveiling Privacy Risks in LLM Agent Memory”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, 2025, pp. 25241–25260. DOI:10.18653/v1/2025.acl-long.1227

work page doi:10.18653/v1/2025.acl-long.1227 2025
[60]

Private Attribute Inference from Images with Vision -Language Models

Batuhan Tömekçe et al. “Private Attribute Inference from Images with Vision -Language Models”. In:Advances in Neural Information Processing Systems 37. 2024, pp. 103619– 103651.DOI:10.52202/079017-3291

work page doi:10.52202/079017-3291 2024
[61]

Doxing via the Lens: Revealing Privacy Leakage in Image Geolocation for Agentic Multi-Modal Large Reasoning Models

Weidi Luo et al. “Doxing via the Lens: Revealing Privacy Leakage in Image Geolocation for Agentic Multi-Modal Large Reasoning Models”. In:arXiv preprintarXiv:2504.19373 (2025)

work page arXiv 2025
[62]

Human-Centered Privacy Research in the Age of Large Language Models

Tianshi Li et al. “Human-Centered Privacy Research in the Age of Large Language Models”. In:Extended Abstracts of the ACM Conference on Human Factors in Computing Systems (CHI) 2024. 2024, Paper No. 59.DOI:10.1145/3613905.3643983

work page doi:10.1145/3613905.3643983 2024
[63]

Can Humans Oversee Agents to Prevent Privacy Leakage? A Study on Privacy Awareness, Preferences, and Trust in Language Model Agents

Zhiping Zhang, Bingcan Guo, and Tianshi Li. “Can Humans Oversee Agents to Prevent Privacy Leakage? A Study on Privacy Awareness, Preferences, and Trust in Language Model Agents”. In:arXiv preprintarXiv:2411.01344 (2024)

work page arXiv 2024
[64]

Apple Support

Apple Inc.About iCloud Private Relay. Apple Support. Accessed: 2025-12-04. 2023.URL: https://support.apple.com/en-sg/102602

work page 2025
[65]

Victor Costan and Srinivas Devadas.Intel SGX Explained. Tech. rep. 2016/086. IACR Cryptology ePrint Archive, 2016.URL:https://eprint.iacr.org/2016/086.pdf

work page 2016
[66]

Attestation Mechanisms for Trusted Execution Environments De- mystified

Jämes Ménétrey et al. “Attestation Mechanisms for Trusted Execution Environments De- mystified”. In:Proceedings of the 2022 Workshop on System Software for Trusted Execution (SysTEX). 2022.DOI:10.1007/978-3-031-16092-9_7

work page doi:10.1007/978-3-031-16092-9_7 2022
[67]

Trusted Mobile Com- puting: An Overview of Existing Solutions

M. Amine Bouazzouni, Emmanuel Conchon, and Fabrice Peyrard. “Trusted Mobile Com- puting: An Overview of Existing Solutions”. In:Future Generation Computer Systems80 (2018), pp. 596–612.DOI:10.1016/j.future.2017.05.029

work page doi:10.1016/j.future.2017.05.029 2018
[68]

PrivacyAsst: Safeguarding User Privacy in Tool-Using Large Language Model Agents

Xinyu Zhang et al. “PrivacyAsst: Safeguarding User Privacy in Tool-Using Large Language Model Agents”. In:IEEE Transactions on Dependable and Secure Computing21.6 (2024), pp. 5242–5258.DOI:10.1109/TDSC.2024.3372777

work page doi:10.1109/tdsc.2024.3372777 2024
[69]

MMPro: A Decoupled Perception-Thinking-Execution Framework for Se- cure GUI Agent

Benlong Wu et al. “MMPro: A Decoupled Perception-Thinking-Execution Framework for Se- cure GUI Agent”. In:Proceedings of the 33rd ACM International Conference on Multimedia (MM 2025). 2025, pp. 4679–4688.DOI:10.1145/3746027.3755553

work page doi:10.1145/3746027.3755553 2025
[70]

Hugging Face Model Card

Tanaos.tanaos-text-anonymizer-v1: A small but performant Text Anonymization model. Hugging Face Model Card. Accessed: 2025-12-05. 2025.URL: https://huggingface. co/tanaos/tanaos-text-anonymizer-v1

work page 2025
[71]

GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning

Zhen Xiang et al. “GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning”. In:Proceedings of the 38th International Conference on Machine Learning (ICML 2025). 2025

work page 2025
[72]

AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection

Weidi Luo et al. “AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Long Papers. 2025, pp. 8104–8139.DOI: 10.18653/v1/2025.acl- long.399

work page doi:10.18653/v1/2025.acl- 2025
[73]

Towards a visual privacy advi- sor: Understanding and predicting privacy risks in images

Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. “Towards a visual privacy advi- sor: Understanding and predicting privacy risks in images”. In:Proceedings of the IEEE international conference on computer vision. 2017, pp. 3686–3695

work page 2017
[74]

Privacyalert: A dataset for image privacy prediction

Chenye Zhao et al. “Privacyalert: A dataset for image privacy prediction”. In:Proceedings of the International AAAI Conference on Web and Social Media. V ol. 16. 2022, pp. 1352–1361. 22

work page 2022
[75]

Evaluation of Human Visual Privacy Protection: Three-Dimensional Framework and Benchmark Dataset

Sara Abdulaziz, Giacomo D’amicantonio, and Egor Bondarev. “Evaluation of Human Visual Privacy Protection: Three-Dimensional Framework and Benchmark Dataset”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025, pp. 5893–5902

work page 2025
[76]

Biv-priv-seg: Locating private content in images taken by people with visual impairments

Yu–Yun Tseng et al. “Biv-priv-seg: Locating private content in images taken by people with visual impairments”. In:2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE. 2025, pp. 430–440

work page 2025
[77]

DIPA2: An Image Dataset with Cross-cultural Privacy Perception An- notations

Anran Xu et al. “DIPA2: An Image Dataset with Cross-cultural Privacy Perception An- notations”. In:Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7.4 (2024), pp. 1–30

work page 2024
[78]

Multi -P²A: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models

Jie Zhang et al. “Multi -P²A: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models”. In:arXiv preprintarXiv:2412.19496 (2024)

work page arXiv 2024
[79]

https : / / huggingface

RootsAutomation.ScreenSpot. https : / / huggingface . co / datasets / rootsautomation/ScreenSpot. Accessed: 2025-11-26

work page 2025
[80]

Screenspot-pro: Gui grounding for professional high-resolution computer use

Kaixin Li et al. “Screenspot-pro: Gui grounding for professional high-resolution computer use”. In:Proceedings of the 33rd ACM International Conference on Multimedia. 2025, pp. 8778–8786

work page 2025
[81]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh et al. “Visualwebarena: Evaluating multimodal agents on realistic visual web tasks”. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, pp. 881–905

work page 2024
[82]

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Quanfeng Lu et al. “GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025, pp. 22404–22414

work page 2025
[83]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie et al. “OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments”. In:Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2024), Datasets & Benchmarks Track. 2024, pp. 52040–52094. DOI:10.52202/079017-1650

work page doi:10.52202/079017-1650 2024

Showing first 80 references.