pith. machine review for the scientific record. sign in

arxiv: 2601.18842 · v3 · submitted 2026-01-26 · 💻 cs.CR · cs.AI· cs.CV

Recognition: no theorem link

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:28 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CV
keywords GUI agentsprivacy preservationscreenshot privacybenchmark evaluationrisk assessmenttrajectory workflowsAndroid environmentsPC interfaces
0
0 comments X

The pith

GUIGuard-Bench shows current models detect private information in GUI screenshots but struggle with precise localization, category recognition, risk assessment, and judging task necessity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GUIGuard-Bench, a dataset of 241 real GUI-agent trajectories containing 4,080 screenshots from Android and PC settings. Each screenshot carries region-level labels for privacy bounding boxes, semantic categories, risk levels, and whether the information is required to finish the task. The benchmark supports three evaluations: how well models recognize privacy elements, whether planners stay consistent when screenshots are protected, and how different protection methods affect task success. Results indicate models commonly notice the presence of private data yet perform poorly on the finer judgments needed for safe operation. This matters because GUI agents that process raw screenshots risk exposing user identities, accounts, and behavior unless they can handle these distinctions reliably.

Core claim

GUIGuard-Bench supplies 241 trajectory-based GUI workflows with 4,080 screenshots annotated at the region level for privacy bounding boxes, categories, risk levels, and task necessity. It measures privacy recognition accuracy, offline planner fidelity after protection is applied to screenshots, and the utility cost of protection strategies. The evaluation finds that models can usually identify whether a screenshot contains private information, yet they falter on fine-grained localization, category recognition, risk assessment, and determining whether the private element is required for the task. Closed-source models maintain largely consistent planner semantics in Android environments once隐私

What carries the argument

The GUIGuard-Bench dataset of trajectory screenshots carrying region-level annotations for privacy elements, risk, and task necessity.

Load-bearing premise

The human-provided region-level annotations for privacy bounding boxes, categories, risk levels, and task necessity accurately capture real-world GUI privacy risks across the collected trajectories.

What would settle it

A new collection of GUI trajectories whose privacy regions and necessity judgments are independently verified by multiple annotators or by observing actual data leaks in controlled agent runs, then re-testing the same models on localization and necessity accuracy.

Figures

Figures reproduced from arXiv: 2601.18842 by Jie Zhang, Jiyan He, Qiannan Zhu, Shuxin Zheng, Weiming Zhang, Wenbo Zhou, Yanxi Wang, Yu Shi, Zhiling Zhang.

Figure 1
Figure 1. Figure 1: Overview of the GUIGuard Framework and Benchmark. The top section shows GUIGuard’s [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) OSWorld benchmark results (accuracy %), comparing representative closed-source [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The dataset structure is illustrated in the figure. It consists of 240 trajectories (4,080 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Privacy recognition results on GUIGuard Bench for both PC (blue) and Android (red) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Fine-grained privacy label recognition on GUIGuard-Bench. For private elements that pass [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Privacy Protection for GUI Screenshots at Two Levels. Risk levels: red = high risk, yellow = medium risk, green = low risk, gray = no risk. Pixel-level masking (Mask): Redact detected private regions using an opaque rectangular mask (blackout or background-color). Semantic-level replacement (Replace): Anonymize sensitive regions via (i) LLM-based text replacement (extract and rewrite private text, then re-… view at source ↗
Figure 7
Figure 7. Figure 7: Task execution and fidelity evaluation framework of GUIGuard-Bench. The grounding [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: (a) Task execution success rates on MobileWorld for the case agent with Gemini 3 as the [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The workflow alternates between the image generation model and the GUI agent: the model [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Task distribution of the GUI agent in (a) real PC and mobile environments and (b) [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
read the original abstract

As GUI agents increasingly rely on screenshots to perceive and operate digital environments, they may inadvertently expose sensitive information such as identities, accounts, locations, and behavioral traces. While existing benchmarks primarily focus on task completion, grounding, or defenses against third-party attacks, current visual privacy datasets remain largely restricted to static natural images, limiting their ability to capture the contextual dependence and task relevance of privacy risks in GUI task trajectories. To bridge this gap, we introduce \textbf{GUIGuard-Bench}, a first-step benchmark for studying privacy-preserving GUI agents in trajectory-based GUI workflows. GUIGuard-Bench contains 241 real GUI-agent trajectories with 4,080 screenshots across Android and PC environments. Each screenshot is annotated at the region level with privacy bounding boxes, semantic privacy categories, risk levels, and whether the private information is necessary for completing the task. Built on these annotations, GUIGuard-Bench supports three complementary evaluations: privacy recognition, offline planning fidelity under protected screenshots, and the utility impact of different protection strategies. Our results show that current models can often detect whether a screenshot contains private information, but they struggle with fine-grained localization, category recognition, risk assessment, and task-necessity judgment. We also find that closed-source models, exemplified by Claude Sonnet 4.6, can maintain largely consistent planner semantics in Android environments after privacy protection is applied. Our results highlight privacy recognition as a critical bottleneck for practical GUI agents. Project: https://futuresis.github.io/GUIGuard-page/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces GUIGuard-Bench, a benchmark with 241 real GUI-agent trajectories and 4,080 screenshots from Android and PC environments. Each screenshot receives region-level annotations for privacy bounding boxes, semantic categories, risk levels, and task necessity. The benchmark supports three evaluations: privacy recognition by models, offline planning fidelity on protected screenshots, and utility impact of protection strategies. Key findings are that current models often detect private information presence but struggle with fine-grained localization, category recognition, risk assessment, and task-necessity judgment, while closed-source models such as Claude Sonnet 4.6 maintain largely consistent planner semantics in Android environments after privacy protection is applied.

Significance. If the human annotations prove reliable, the benchmark addresses a clear gap between static-image privacy datasets and the contextual, trajectory-based privacy risks faced by GUI agents. The dual focus on recognition failures and downstream planning consistency provides actionable diagnostics for privacy-preserving agent design. The open release of trajectories and annotations could enable reproducible follow-up work on protection mechanisms.

major comments (3)
  1. [Dataset Construction and Annotation] Dataset annotation section: No inter-annotator agreement statistics, multiple-annotator protocol, or external validation against privacy experts are reported for the subjective labels (risk levels and task necessity). These labels directly underpin the headline claims about model struggles with risk assessment and task-necessity judgment; without agreement metrics, systematic annotator bias cannot be ruled out as an alternative explanation for the observed performance gaps.
  2. [Evaluation Results] Evaluation results: The claims that models 'can often detect' private information yet 'struggle' with localization, categories, risk assessment, and task-necessity judgment are presented without accompanying quantitative metrics (precision/recall, accuracy, or confusion matrices), error analysis, or the full evaluation protocol. This absence prevents assessment of effect sizes and reproducibility of the reported bottlenecks.
  3. [Offline Planning Fidelity] Planning fidelity evaluation: The consistency result for Claude Sonnet 4.6 after privacy protection is stated at a high level but lacks the concrete measurement protocol (e.g., semantic similarity metric, planner output comparison method, or control conditions) needed to substantiate that the planner semantics remain 'largely consistent.'
minor comments (2)
  1. [Abstract] Abstract: The summary of results would be strengthened by including at least one or two key quantitative figures (e.g., detection accuracy or consistency score) rather than purely qualitative statements.
  2. [Figures] Figure and table captions: Ensure all figures showing annotation examples or model outputs include explicit scale bars, color legends, and sample sizes so readers can interpret them without returning to the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for greater rigor and clarity.

read point-by-point responses
  1. Referee: [Dataset Construction and Annotation] Dataset annotation section: No inter-annotator agreement statistics, multiple-annotator protocol, or external validation against privacy experts are reported for the subjective labels (risk levels and task necessity). These labels directly underpin the headline claims about model struggles with risk assessment and task-necessity judgment; without agreement metrics, systematic annotator bias cannot be ruled out as an alternative explanation for the observed performance gaps.

    Authors: We agree that inter-annotator agreement metrics are essential for subjective labels. The revised manuscript will include a detailed description of the multiple-annotator protocol (three annotators per label with disagreement resolution via discussion) and report agreement statistics such as Fleiss' kappa for risk levels and task necessity. We will also acknowledge that external validation by privacy experts was not performed and discuss this as a limitation. revision: yes

  2. Referee: [Evaluation Results] Evaluation results: The claims that models 'can often detect' private information yet 'struggle' with localization, categories, risk assessment, and task-necessity judgment are presented without accompanying quantitative metrics (precision/recall, accuracy, or confusion matrices), error analysis, or the full evaluation protocol. This absence prevents assessment of effect sizes and reproducibility of the reported bottlenecks.

    Authors: We agree that the evaluation section would benefit from explicit quantitative support. The revised manuscript will add precision, recall, accuracy, confusion matrices, and a dedicated error analysis subsection for the privacy recognition tasks. The full evaluation protocol will be described in the main text (with additional details moved from the appendix) to ensure reproducibility. revision: yes

  3. Referee: [Offline Planning Fidelity] Planning fidelity evaluation: The consistency result for Claude Sonnet 4.6 after privacy protection is stated at a high level but lacks the concrete measurement protocol (e.g., semantic similarity metric, planner output comparison method, or control conditions) needed to substantiate that the planner semantics remain 'largely consistent.'

    Authors: We will expand the offline planning fidelity section in the revision to specify the concrete protocol. This will include the semantic similarity metric (embedding cosine similarity), the exact method for comparing planner outputs, and the control conditions used to support the consistency claim for Claude Sonnet 4.6 in Android environments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct observations

full rationale

The paper constructs GUIGuard-Bench from 241 trajectories and 4,080 human-annotated screenshots, then reports direct model evaluations on privacy detection, localization, and planning fidelity. No equations, fitted parameters, or predictions appear; results are observational comparisons against the annotations rather than derivations that reduce to inputs by construction. Self-citations are absent from load-bearing claims, and the benchmark is externally falsifiable via the released annotations and trajectories.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the creation and annotation of a new dataset plus empirical tests on existing models; no free parameters are introduced, and the only notable assumption is the accuracy of human annotations.

axioms (1)
  • domain assumption Human annotations for privacy bounding boxes, semantic categories, risk levels, and task necessity are accurate and unbiased.
    All three supported evaluations depend directly on these labels being reliable.

pith-pipeline@v0.9.0 · 5604 in / 1392 out tokens · 51506 ms · 2026-05-16T11:28:23.522972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Contrastive Privacy: A Semantic Approach to Measuring Privacy of AI-based Sanitization

    cs.CR 2026-05 unverdicted novelty 7.0

    Contrastive privacy is a new corpus-contrast test for semantic privacy in AI-sanitized media that uses latent concept measures and requires no manual labeling.

  2. RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

    cs.AI 2026-04 unverdicted novelty 7.0

    RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

  3. Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    Phone-use agents avoid harm more often through inability to act than through deliberate safe choices, so benchmarks must separate unsafe judgment from capability failure.

  4. Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

    cs.CL 2026-05 unverdicted novelty 4.0

    The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · cited by 4 Pith papers · 4 internal anchors

  1. [1]

    https://blog

    Google DeepMind.Project Astra: A Universal Multimodal AI Assistant. https://blog. google/technology/google-deepmind/gemini-universal-ai-assistant/. 2025

  2. [2]

    Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

    Mathieu Andreux et al. “Surfer 2: The Next Generation of Cross-Platform Computer Use Agents”. In:arXiv preprint arXiv:2510.19949(2025)

  3. [3]

    GUI Agents: A Survey

    Dang Nguyen et al. “GUI Agents: A Survey”. In:Findings of the Association for Compu- tational Linguistics: ACL 2025. Ed. by Wanxiang Che et al. Vienna, Austria: Association for Computational Linguistics, July 2025, pp. 22522–22538.DOI: 10.18653/v1/2025. findings-acl.1158

  4. [4]

    GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

    Chiyu Chen et al. “GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?” In:arXiv preprint arXiv:2510.20333(2025)

  5. [5]

    Product webpage

    ByteDance (Doubao Team).Doubao Phone Assistant (Technical Preview). Product webpage. Launched Dec 1, 2025. Accessed Dec 19, 2025. Dec. 2025.URL: https://o.doubao.com/. 19

  6. [6]

    Product webpage

    Alibaba Cloud.Wuying AgentBay (AgentBay): All-scenario AI Agent Execution Platform. Product webpage. Accessed Dec 19, 2025. 2025.URL: https : / / www . aliyun . com / product/agentbay

  7. [7]

    OpenAI product announcement

    OpenAI.Introducing ChatGPT Atlas. OpenAI product announcement. Accessed Dec 19,

  8. [8]

    2025.URL:https://openai.com/index/introducing-chatgpt-atlas/

    Oct. 2025.URL:https://openai.com/index/introducing-chatgpt-atlas/

  9. [9]

    GPT-4 as a source of patient information for anterior cervical discectomy and fusion: a comparative analysis against Google web search

    Paul G Mastrokostas et al. “GPT-4 as a source of patient information for anterior cervical discectomy and fusion: a comparative analysis against Google web search”. In:Global Spine Journal14.8 (2024), pp. 2389–2398

  10. [10]

    Exploring the potential of large language models and generative artifi- cial intelligence (GPT): Applications in Library and Information Science

    Matus Formanek. “Exploring the potential of large language models and generative artifi- cial intelligence (GPT): Applications in Library and Information Science”. In:Journal of Librarianship and Information Science57.2 (2025), pp. 568–590

  11. [11]

    Large language models empowered personalized web agents

    Hongru Cai et al. “Large language models empowered personalized web agents”. In:Pro- ceedings of the ACM on Web Conference 2025. 2025, pp. 198–215

  12. [12]

    Bearcubs: A benchmark for computer-using web agents

    Yixiao Song et al. “Bearcubs: A benchmark for computer-using web agents”. In:arXiv preprint arXiv:2503.07919(2025)

  13. [13]

    A survey of webagents: Towards next-generation ai agents for web automation with large foundation models

    Liangbo Ning et al. “A survey of webagents: Towards next-generation ai agents for web automation with large foundation models”. In:Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2. 2025, pp. 6140–6150

  14. [14]

    Websight: A vision-first architecture for robust web agents

    Tanvir Bhathal and Asanshay Gupta. “Websight: A vision-first architecture for robust web agents”. In:arXiv preprint arXiv:2508.16987(2025)

  15. [15]

    Gui testing arena: A unified benchmark for advancing autonomous gui testing agent

    Kangjia Zhao et al. “Gui testing arena: A unified benchmark for advancing autonomous gui testing agent”. In:arXiv preprint arXiv:2412.18426(2024)

  16. [16]

    Towards trustworthy gui agents: A survey

    Yucheng Shi et al. “Towards trustworthy gui agents: A survey”. In:arXiv preprint arXiv:2503.23434(2025)

  17. [17]

    Clear: Towards contextual llm-empowered privacy policy analysis and risk generation for large language model applications

    Chaoran Chen et al. “Clear: Towards contextual llm-empowered privacy policy analysis and risk generation for large language model applications”. In:Proceedings of the 30th International Conference on Intelligent User Interfaces. 2025, pp. 277–297

  18. [18]

    Large Language Model-Brained GUI Agents: A Survey

    Chaoyun Zhang et al. “Large Language Model-Brained GUI Agents: A Survey”. In:Transac- tions on Machine Learning Research2025.1 (2025)

  19. [19]

    Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models

    Pete Janowczyk et al. “Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models”. In:arXiv preprint arXiv:2411.05056(2024)

  20. [20]

    EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage

    Zeyi Liao et al. “EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage”. In:Proceedings of the International Conference on Representation Learning (ICLR) 2025. 2025

  21. [21]

    When LLMs Go Online: The Emerging Threat of Web-Enabled LLMs

    Hanna Kim et al. “When LLMs Go Online: The Emerging Threat of Web-Enabled LLMs”. In:34th USENIX Security Symposium (USENIX Security 25). 2025, pp. 1729–1748

  22. [22]

    Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation

    Yuyang Wanyan et al. “Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation”. In:arXiv preprint arXiv:2506.04614(2025)

  23. [23]

    Anthropic.Agents and Tools: Computer Use. Online. Accessed: March 16, 2025. 2025

  24. [25]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang et al. “Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning”. In:arXiv preprint arXiv:2509.02544(2025)

  25. [26]

    Privacyasst: Safeguarding user privacy in tool-using large language model agents

    Xinyu Zhang et al. “Privacyasst: Safeguarding user privacy in tool-using large language model agents”. In:IEEE Transactions on Dependable and Secure Computing21.6 (2024), pp. 5242–5258

  26. [27]

    Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning

    Zhen Xiang et al. “Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning”. In:arXiv preprint arXiv:2406.09187(2024)

  27. [28]

    OpenAI.Computer-using Agent. Online. Accessed: March 16, 2025. 2025

  28. [29]

    Adversaflow: Visual red teaming for large language models with multi- level adversarial flow

    Dazhen Deng et al. “Adversaflow: Visual red teaming for large language models with multi- level adversarial flow”. In:IEEE Transactions on Visualization and Computer Graphics (2024)

  29. [30]

    Autodroid: Llm-powered task automation in android

    Hao Wen et al. “Autodroid: Llm-powered task automation in android”. In:Proceedings of the 30th Annual International Conference on Mobile Computing and Networking. 2024, pp. 543–557. 20

  30. [31]

    Mla-trust: Benchmarking trustworthiness of multimodal llm agents in gui environments

    Xiao Yang et al. “Mla-trust: Benchmarking trustworthiness of multimodal llm agents in gui environments”. In:arXiv preprint arXiv:2506.01616(2025)

  31. [32]

    Caution for the environment: Multimodal agents are susceptible to environ- mental distractions

    Xinbei Ma et al. “Caution for the environment: Multimodal agents are susceptible to environ- mental distractions”. In:arXiv preprint arXiv:2408.02544(2024)

  32. [33]

    OpenAI.GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum. Tech. rep. Ope- nAI, 2025.URL: https : / / cdn . openai . com / pdf / 4173ec8d - 1229 - 47db - 96de - 06d87147e07e/5_1_system_card.pdf

  33. [34]

    https : / / ai

    Google DeepMind.Gemini 3 Developer Guide. https : / / ai . google . dev / gemini - api/docs/gemini-3. 2025

  34. [35]

    Anthropic.Claude 4.5 Models.https://platform.claude.com/docs/models. 2025

  35. [36]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano et al. “WebGPT: Browser-assisted question-answering with human feed- back”. In:arXiv preprint arXiv:2112.09332(2021)

  36. [37]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao et al. “ReAct: Synergizing Reasoning and Acting in Language Models”. In: International Conference on Learning Representations. 2023

  37. [38]

    Multimodal Web Navigation with Instruction-Finetuned Foundation Models

    Hiroki Furuta et al. “Multimodal Web Navigation with Instruction-Finetuned Foundation Models”. In:International Conference on Learning Representations. 2024

  38. [39]

    GPT-4V(ision) is a Generalist Web Agent, if Grounded

    Boyuan Zheng et al. “GPT-4V(ision) is a Generalist Web Agent, if Grounded”. In:Proceed- ings of the 41st International Conference on Machine Learning. Ed. by Ruslan Salakhutdinov et al. V ol. 235. Proceedings of Machine Learning Research. PMLR, 2024, pp. 61349–61385

  39. [40]

    AppAgent: Multimodal Agents as Smartphone Users

    Chi Zhang et al. “AppAgent: Multimodal Agents as Smartphone Users”. In:arXiv preprint arXiv:2312.13771(2023)

  40. [41]

    AppAgent v2: Advanced Agent for Flexible Mobile Interactions

    Yanda Li et al. “AppAgent v2: Advanced Agent for Flexible Mobile Interactions”. In:arXiv preprint arXiv:2408.11824(2024)

  41. [42]

    From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

    Peter Shaw et al. “From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces”. In:Advances in Neural Information Processing Systems. Ed. by A. Oh et al. V ol. 36. Curran Associates, Inc., 2023, pp. 34354–34370

  42. [43]

    SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    Kanzhi Cheng et al. “SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents”. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 9313–9332.DOI:10.18653/v1/2024.acl-long.505

  43. [44]

    ScreenAgent: A Vision Language Model-driven Computer Control Agent

    Runliang Niu et al. “ScreenAgent: A Vision Language Model-driven Computer Control Agent”. In:Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI-24). IJCAI, 2024, pp. 6433–6441.DOI:10.24963/ijcai.2024/711

  44. [45]

    OmniParser for Pure Vision Based GUI Agent

    Yadong Lu et al. “OmniParser for Pure Vision Based GUI Agent”. In:arXiv preprint arXiv:2408.00203(2024)

  45. [46]

    Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

    Keen You et al. “Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs”. In: Computer Vision – ECCV 2024. V ol. 15122. Lecture Notes in Computer Science. Springer Nature Switzerland AG, 2024, pp. 240–255.DOI:10.1007/978-3-031-73039-9_14

  46. [47]

    Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

    Zhangheng Li et al. “Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms”. In:Proceedings of the International Conference on Learning Representations (ICLR) 2025. 2025

  47. [48]

    https : / / openai

    OpenAI.Introducing Operator. https : / / openai . com / zh - Hans - CN / index / introducing-operator/. Accessed: 2025-11-23. 2024

  48. [49]

    https://openai.com/zh-Hans-CN/index/computer- using-agent/

    OpenAI.Computer-Using Agent. https://openai.com/zh-Hans-CN/index/computer- using-agent/. Accessed: 2025-11-24. 2024

  49. [50]

    https : / / www

    Anthropic.Developing Computer Use. https : / / www . anthropic . com / news / developing-computer-use. Accessed: 2025-11-24. 2024

  50. [51]

    https : / / blog

    Google DeepMind.Introducing the Gemini 2.5 Computer Use model. https : / / blog . google/technology/google-deepmind/gemini-computer-use-model/ . Accessed: 2025-11-24. 2025

  51. [54]

    OpenCUA: Open Foundations for Computer-Use Agents

    Xinyuan Wang et al. “OpenCUA: Open Foundations for Computer-Use Agents”. In:arXiv preprint arXiv:2508.09123(2025). 21

  52. [55]

    EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage

    Zeyi Liao et al. “EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage”. In:Proceedings of the International Conference on Learning Representations (ICLR) 2025. Poster. 2025

  53. [56]

    Imprompter: Tricking LLM Agents into Improper Tool Use

    Xiaohan Fu et al. “Imprompter: Tricking LLM Agents into Improper Tool Use”. In:arXiv preprint arXiv:2410.14923(2024)

  54. [57]

    The Obvious Invisible Threat: LLM-Powered GUI Agents’ Vulnerability to Fine-Print Injections

    Chaoran Chen et al. “The Obvious Invisible Threat: LLM-Powered GUI Agents’ Vulnerability to Fine-Print Injections”. In:arXiv preprint arXiv:2504.11281(2025)

  55. [58]

    Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution

    Meysam Alizadeh et al. “Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution”. In:arXiv preprint arXiv:2506.01055(2025)

  56. [59]

    Unveiling Privacy Risks in LLM Agent Memory

    Bo Wang et al. “Unveiling Privacy Risks in LLM Agent Memory”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, 2025, pp. 25241–25260. DOI:10.18653/v1/2025.acl-long.1227

  57. [60]

    Private Attribute Inference from Images with Vision -Language Models

    Batuhan Tömekçe et al. “Private Attribute Inference from Images with Vision -Language Models”. In:Advances in Neural Information Processing Systems 37. 2024, pp. 103619– 103651.DOI:10.52202/079017-3291

  58. [61]

    Doxing via the Lens: Revealing Privacy Leakage in Image Geolocation for Agentic Multi-Modal Large Reasoning Models

    Weidi Luo et al. “Doxing via the Lens: Revealing Privacy Leakage in Image Geolocation for Agentic Multi-Modal Large Reasoning Models”. In:arXiv preprintarXiv:2504.19373 (2025)

  59. [62]

    Human-Centered Privacy Research in the Age of Large Language Models

    Tianshi Li et al. “Human-Centered Privacy Research in the Age of Large Language Models”. In:Extended Abstracts of the ACM Conference on Human Factors in Computing Systems (CHI) 2024. 2024, Paper No. 59.DOI:10.1145/3613905.3643983

  60. [63]

    Can Humans Oversee Agents to Prevent Privacy Leakage? A Study on Privacy Awareness, Preferences, and Trust in Language Model Agents

    Zhiping Zhang, Bingcan Guo, and Tianshi Li. “Can Humans Oversee Agents to Prevent Privacy Leakage? A Study on Privacy Awareness, Preferences, and Trust in Language Model Agents”. In:arXiv preprintarXiv:2411.01344 (2024)

  61. [64]

    Apple Support

    Apple Inc.About iCloud Private Relay. Apple Support. Accessed: 2025-12-04. 2023.URL: https://support.apple.com/en-sg/102602

  62. [65]

    Victor Costan and Srinivas Devadas.Intel SGX Explained. Tech. rep. 2016/086. IACR Cryptology ePrint Archive, 2016.URL:https://eprint.iacr.org/2016/086.pdf

  63. [66]

    Attestation Mechanisms for Trusted Execution Environments De- mystified

    Jämes Ménétrey et al. “Attestation Mechanisms for Trusted Execution Environments De- mystified”. In:Proceedings of the 2022 Workshop on System Software for Trusted Execution (SysTEX). 2022.DOI:10.1007/978-3-031-16092-9_7

  64. [67]

    Trusted Mobile Com- puting: An Overview of Existing Solutions

    M. Amine Bouazzouni, Emmanuel Conchon, and Fabrice Peyrard. “Trusted Mobile Com- puting: An Overview of Existing Solutions”. In:Future Generation Computer Systems80 (2018), pp. 596–612.DOI:10.1016/j.future.2017.05.029

  65. [68]

    PrivacyAsst: Safeguarding User Privacy in Tool-Using Large Language Model Agents

    Xinyu Zhang et al. “PrivacyAsst: Safeguarding User Privacy in Tool-Using Large Language Model Agents”. In:IEEE Transactions on Dependable and Secure Computing21.6 (2024), pp. 5242–5258.DOI:10.1109/TDSC.2024.3372777

  66. [69]

    MMPro: A Decoupled Perception-Thinking-Execution Framework for Se- cure GUI Agent

    Benlong Wu et al. “MMPro: A Decoupled Perception-Thinking-Execution Framework for Se- cure GUI Agent”. In:Proceedings of the 33rd ACM International Conference on Multimedia (MM 2025). 2025, pp. 4679–4688.DOI:10.1145/3746027.3755553

  67. [70]

    Hugging Face Model Card

    Tanaos.tanaos-text-anonymizer-v1: A small but performant Text Anonymization model. Hugging Face Model Card. Accessed: 2025-12-05. 2025.URL: https://huggingface. co/tanaos/tanaos-text-anonymizer-v1

  68. [71]

    GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning

    Zhen Xiang et al. “GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning”. In:Proceedings of the 38th International Conference on Machine Learning (ICML 2025). 2025

  69. [72]

    AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection

    Weidi Luo et al. “AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Long Papers. 2025, pp. 8104–8139.DOI: 10.18653/v1/2025.acl- long.399

  70. [73]

    Towards a visual privacy advi- sor: Understanding and predicting privacy risks in images

    Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. “Towards a visual privacy advi- sor: Understanding and predicting privacy risks in images”. In:Proceedings of the IEEE international conference on computer vision. 2017, pp. 3686–3695

  71. [74]

    Privacyalert: A dataset for image privacy prediction

    Chenye Zhao et al. “Privacyalert: A dataset for image privacy prediction”. In:Proceedings of the International AAAI Conference on Web and Social Media. V ol. 16. 2022, pp. 1352–1361. 22

  72. [75]

    Evaluation of Human Visual Privacy Protection: Three-Dimensional Framework and Benchmark Dataset

    Sara Abdulaziz, Giacomo D’amicantonio, and Egor Bondarev. “Evaluation of Human Visual Privacy Protection: Three-Dimensional Framework and Benchmark Dataset”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025, pp. 5893–5902

  73. [76]

    Biv-priv-seg: Locating private content in images taken by people with visual impairments

    Yu–Yun Tseng et al. “Biv-priv-seg: Locating private content in images taken by people with visual impairments”. In:2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE. 2025, pp. 430–440

  74. [77]

    DIPA2: An Image Dataset with Cross-cultural Privacy Perception An- notations

    Anran Xu et al. “DIPA2: An Image Dataset with Cross-cultural Privacy Perception An- notations”. In:Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7.4 (2024), pp. 1–30

  75. [78]

    Multi -P²A: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models

    Jie Zhang et al. “Multi -P²A: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models”. In:arXiv preprintarXiv:2412.19496 (2024)

  76. [79]

    https : / / huggingface

    RootsAutomation.ScreenSpot. https : / / huggingface . co / datasets / rootsautomation/ScreenSpot. Accessed: 2025-11-26

  77. [80]

    Screenspot-pro: Gui grounding for professional high-resolution computer use

    Kaixin Li et al. “Screenspot-pro: Gui grounding for professional high-resolution computer use”. In:Proceedings of the 33rd ACM International Conference on Multimedia. 2025, pp. 8778–8786

  78. [81]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh et al. “Visualwebarena: Evaluating multimodal agents on realistic visual web tasks”. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, pp. 881–905

  79. [82]

    GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

    Quanfeng Lu et al. “GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025, pp. 22404–22414

  80. [83]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie et al. “OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments”. In:Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2024), Datasets & Benchmarks Track. 2024, pp. 52040–52094. DOI:10.52202/079017-1650

Showing first 80 references.