MaskClaw: Edge-Side Personalized Privacy Arbitration for GUI Agents with Behavior-Driven Skill Evolution

Dongying Zheng; Kaibo Huang; Linna Zhou; Yanqiu Zhao; Yukun Wei; Zhongliang Yang

arxiv: 2605.28646 · v2 · pith:O3M6ODULnew · submitted 2026-05-27 · 💻 cs.CR · cs.CL

MaskClaw: Edge-Side Personalized Privacy Arbitration for GUI Agents with Behavior-Driven Skill Evolution

Yanqiu Zhao , Dongying Zheng , Kaibo Huang , Yukun Wei , Zhongliang Yang , Linna Zhou This is my paper

Pith reviewed 2026-06-29 11:34 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords GUI agentsprivacy arbitrationedge computingscreenshot privacyskill evolutionpolicy memorypersonalized decisions

0 comments

The pith

MaskClaw performs Allow/Mask/Ask privacy decisions for GUI agent screenshots locally on the edge using visual evidence and policy memory retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GUI agents need screenshots to act across apps, yet those images routinely contain private messages, records, and credentials whose protection depends on task, recipient, role, and state. Static detectors miss the boundaries while cloud reasoning requires uploading the raw screen first. MaskClaw keeps the entire decision inside a trusted edge environment by extracting local visual evidence, retrieving user- and task-specific policy memory, and outputting Allow, Mask, or Ask. The system also converts user corrections, cancellations, and edits into reusable privacy skills that pass through a sandbox gate, shown across five designed evolution scenarios on the P-GUI-Evo benchmark built from real UI patterns and sanitized labels.

Core claim

MaskClaw is an edge-side privacy arbitrator that extracts local visual evidence, retrieves user- and task-specific policy memory, and decides Allow, Mask, or Ask before raw screenshots leave a trusted user- or organization-controlled environment. It turns corrections, cancellations, and edits into reusable privacy skills checked by a sandbox gate in five designed skill-evolution scenarios and is evaluated on the P-GUI-Evo benchmark.

What carries the argument

MaskClaw arbitration pipeline that pairs local visual evidence extraction with policy memory retrieval to produce Allow/Mask/Ask decisions, plus behavior-driven skill evolution stored behind a sandbox gate.

If this is right

Raw screenshots never leave the trusted environment before a privacy decision is reached.
User corrections and edits are converted into stored, sandbox-checked privacy skills for later reuse.
Pattern matching, cloud reasoning, and routing alone produce more over-confirmations, over-masking, or raw exposures under the same protocol.
Decisions incorporate task, recipient, application state, and user role through retrieved policy memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be applied to non-GUI agent interfaces that also consume visual input.
Evolved skills might be shared or versioned across users or organizations while remaining local.

Load-bearing premise

Local visual evidence plus retrieved policy memory is sufficient for accurate Allow/Mask/Ask decisions across applications and roles without cloud reasoning, and user corrections reliably produce reusable skills that generalize.

What would settle it

A new UI pattern or role outside the five scenarios and P-GUI-Evo benchmark where MaskClaw outputs an incorrect Allow, Mask, or Ask decision, or where a correction fails to produce a skill that applies to a later similar case.

Figures

Figures reproduced from arXiv: 2605.28646 by Dongying Zheng, Kaibo Huang, Linna Zhou, Yanqiu Zhao, Yukun Wei, Zhongliang Yang.

**Figure 2.** Figure 2: MaskClaw architecture. The edge-side layer keeps raw screenshots local, extracts visual evidence, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Skill-use checks in five controlled scenarios. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Post-optimization filtering from strict text [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative SafeScreenshot examples from the strict L3 audit. The first row shows a synthetic bank-card [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Representative reconstructed benchmark examples. Ask and Allow keep the original UI on both sides; [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

GUI agents rely on screenshots to infer intent and operate across applications, but these screenshots often contain private messages, medical records, payment credentials, and workplace-specific workflows. Privacy decisions in this setting depend on task, recipient, application state, and user role, yet static PII detectors miss these boundaries and cloud-side VLM reasoning can upload the raw screen before deciding what should be protected. We present MaskClaw, an edge-side privacy arbitrator for GUI agents. MaskClaw extracts local visual evidence, retrieves user- and task-specific policy memory, and decides Allow, Mask, or Ask before raw screenshots leave a trusted user- or organization-controlled environment. In five designed skill-evolution scenarios, it turns corrections, cancellations, and edits into reusable privacy skills checked by a sandbox gate. We introduce P-GUI-Evo, a benchmark built from real UI patterns, reconstructed HTML screens, and sanitized labels. Experiments show that pattern matching, cloud reasoning, and routing alone tend to over-confirm, over-mask, or expose raw screenshots under the same protocol. The artifact is available at https://github.com/Theodora-Y/MaskClaw.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MaskClaw gives a workable edge-only flow for GUI agent privacy decisions plus a new benchmark, but the five-scenario tests leave the accuracy and generalization claims unproven.

read the letter

MaskClaw keeps privacy decisions on the device by pulling local visual evidence and user-task policy memory to output Allow, Mask, or Ask before any raw screenshot leaves the edge. It also converts user corrections into reusable skills inside a sandbox gate and ships the P-GUI-Evo benchmark built from real UI patterns.

The concrete contribution is the end-to-end on-device loop that avoids the cloud upload step entirely. That matches a real deployment constraint for agents that see messages, records, and internal workflows. The artifact link and the benchmark construction from reconstructed HTML are useful for anyone who wants to test similar ideas.

The soft spot is the evaluation. The paper reports that pattern matching, cloud reasoning, and routing baselines over-confirm or over-mask under the same protocol, yet the results rest on five designed skill-evolution scenarios. Nothing in the description shows that the extracted local evidence captures recipient, role, or application-state details that static detectors miss, or that the evolved skills transfer to unseen UI layouts. If either piece fails, the edge guarantee does not hold and the system falls back to the problems it criticizes.

This is for groups working on on-device agents or privacy controls for automation tools. A reader who needs a starting point for edge arbitration or a benchmark to extend would find the setup worth examining.

It deserves a serious referee because the proposal is self-contained, the artifact exists, and the problem is timely, even if the current experiments need expansion on generalization and decision metrics.

Referee Report

3 major / 0 minor

Summary. The paper introduces MaskClaw, an edge-side privacy arbitrator for GUI agents. It extracts local visual evidence from screenshots, retrieves user- and task-specific policy memory, and outputs Allow/Mask/Ask decisions before raw screenshots leave a trusted environment. The system evolves reusable privacy skills from user corrections, cancellations, and edits via a sandbox gate. Evaluation occurs on five designed skill-evolution scenarios using the new P-GUI-Evo benchmark (real UI patterns, reconstructed HTML, sanitized labels). Experiments indicate that pattern matching, cloud reasoning, and routing baselines over-confirm, over-mask, or expose raw data under the same protocol.

Significance. If the local extraction and policy retrieval produce accurate decisions without cloud reasoning and if corrections yield generalizable skills, the work would address a practical gap in privacy for GUI agents by avoiding raw screenshot uploads. The P-GUI-Evo benchmark and the artifact release are constructive contributions for future evaluation in this area.

major comments (3)

[Abstract] Abstract: The central claim that local visual evidence plus retrieved policy memory suffices for correct Allow/Mask/Ask decisions across applications and roles rests on unshown mechanisms; no description is given of how context-dependent factors (recipient, application state, user role) are captured locally or how policy completeness is ensured to avoid both over-masking and under-protection.
[Abstract] Abstract: The claim that corrections produce reusable skills that generalize is load-bearing for the 'behavior-driven skill evolution' contribution, yet the evaluation is limited to five designed scenarios on P-GUI-Evo with no reported metrics on transfer to unseen UI patterns, sandbox-gate enforcement details, or failure cases.
[Abstract] Abstract: No quantitative results (accuracy, false-positive rates, comparison tables, or protocol details) are provided, so it is impossible to verify that the system outperforms the criticized baselines or that the edge-side guarantee holds.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and will revise the manuscript to improve clarity on mechanisms, evaluation details, and quantitative results.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that local visual evidence plus retrieved policy memory suffices for correct Allow/Mask/Ask decisions across applications and roles rests on unshown mechanisms; no description is given of how context-dependent factors (recipient, application state, user role) are captured locally or how policy completeness is ensured to avoid both over-masking and under-protection.

Authors: The full manuscript (Section 3) describes local extraction via UI element detection and OCR on screenshots to capture application state, while policy memory stores user-specific entries that encode roles, recipients, and task contexts. Completeness is maintained via the evolution loop from corrections. We agree the abstract is too concise and will expand it with a one-sentence summary of these mechanisms. revision: yes
Referee: [Abstract] Abstract: The claim that corrections produce reusable skills that generalize is load-bearing for the 'behavior-driven skill evolution' contribution, yet the evaluation is limited to five designed scenarios on P-GUI-Evo with no reported metrics on transfer to unseen UI patterns, sandbox-gate enforcement details, or failure cases.

Authors: The five scenarios test evolution under controlled conditions; the sandbox gate is described as a verification step prior to skill storage. We will add explicit details on gate enforcement and observed failure modes. Transfer metrics to unseen patterns are not currently reported and would require new experiments; we will either add preliminary results or note this as a limitation. revision: partial
Referee: [Abstract] Abstract: No quantitative results (accuracy, false-positive rates, comparison tables, or protocol details) are provided, so it is impossible to verify that the system outperforms the criticized baselines or that the edge-side guarantee holds.

Authors: The current text reports qualitative outcomes for the baselines under the shared protocol. We acknowledge the absence of numeric metrics and will insert accuracy, false-positive rates, comparison tables, and protocol details into the experiments section of the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; system description and benchmark evaluation are self-contained.

full rationale

The paper describes an engineering system (MaskClaw) that performs local visual extraction, policy retrieval, and Allow/Mask/Ask decisions, then evaluates it on five designed skill-evolution scenarios using the introduced P-GUI-Evo benchmark. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear. Claims rest on empirical behavior in the provided scenarios and artifact rather than reducing to inputs by construction. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz. This is a standard non-circular system paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities; ledger left empty.

pith-pipeline@v0.9.1-grok · 5746 in / 1174 out tokens · 36683 ms · 2026-06-29T11:34:57.048356+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents
cs.CR 2026-06 unverdicted novelty 6.0

CAPED reduces incidental visual privacy leakage in mobile GUI agents from 0.766 to 0.268 on seeded AndroidWorld tasks by selectively exposing only task-relevant screen content.

Reference graph

Works this paper leans on

8 extracted references · 7 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Assessing Privacy Preservation and Utility in Online Vision-Language Models

Assessing Privacy Preservation and Util- ity in Online Vision-Language Models.Preprint, arXiv:2604.09695. Yining Chen, Jihao Zhao, Bo Tang, Haofen Wang, Yue Zhang, Fei Huang, Feiyu Xiong, and Zhiyu Li. 2026. MemPrivacy: Privacy-Preserving Personalized Mem- ory Management for Edge-Cloud Agents.Preprint, arXiv:2605.09530. Qingyan Guo, Rui Wang, Junliang Guo...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

InProceedings of the 17th Conference of the European Chapter of the Association for Computa- tional Linguistics: Tutorial Abstracts, pages 27–30, Dubrovnik, Croatia

Privacy-Preserving Natural Language Process- ing. InProceedings of the 17th Conference of the European Chapter of the Association for Computa- tional Linguistics: Tutorial Abstracts, pages 27–30, Dubrovnik, Croatia. Association for Computational Linguistics. Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yu...

work page arXiv 2024
[3]

Autoglm: Autonomous foundation agents for guis

Curran Associates, Inc. Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, and Mike Zheng Shou. 2025. Computer-Use Agents as Judges for Generative User Interface.arXiv preprint. Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Jun- jun Sh...

work page arXiv 2025
[4]

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Jiadai Sun, Xinyue Yang, Yu Yang, Shuntian Yao, Wei Xu, Jie Tang, and Yuxiao Dong

MemGPT: Towards LLMs as Operating Sys- tems.arXiv preprint. Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Jiadai Sun, Xinyue Yang, Yu Yang, Shuntian Yao, Wei Xu, Jie Tang, and Yuxiao Dong. 2025. We- bRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning.Inter- national Conference on Learning Representations, 2...

2025
[5]

Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, and Sara- van Rajmohan

Mobile-Agent: Autonomous Multi-Modal Mo- bile Device Agent with Visual Perception.arXiv preprint. Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, and Sara- van Rajmohan. 2025a. Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM- Powered Agents.Preprint, arXiv:2509.17488. Shuai Wang, Weiw...

work page arXiv 2026
[6]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Large Language Models as Optimizers.In- ternational Conference on Learning Representations, 2024:12028–12068. Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun- yuan Li, and Jianfeng Gao. 2023. Set-of-Mark Prompting Unleashes Extraordinary Visual Ground- ing in GPT-4V. https://arxiv.org/abs/2310.11441v2. Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Ferret-UI: Grounded mobile UI understanding with multimodal LLMs, 2024

MiniCPM-V: A GPT-4V Level MLLM on Your Phone.arXiv preprint. Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2024. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs.Preprint, arXiv:2404.05719. Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guest...

work page arXiv 2024
[8]

InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, pages 1–20, New York, NY , USA

AppAgent: Multimodal Agents as Smartphone Users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, pages 1–20, New York, NY , USA. Association for Computing Machinery. Jie Zhang, Xiangkui Cao, Zhouyu Han, Shiguang Shan, and Xilin Chen. 2026. Multi-PA: A Multi- perspective Benchmark on Privacy Assessment for Large Vis...

work page arXiv 2025

[1] [1]

Assessing Privacy Preservation and Utility in Online Vision-Language Models

Assessing Privacy Preservation and Util- ity in Online Vision-Language Models.Preprint, arXiv:2604.09695. Yining Chen, Jihao Zhao, Bo Tang, Haofen Wang, Yue Zhang, Fei Huang, Feiyu Xiong, and Zhiyu Li. 2026. MemPrivacy: Privacy-Preserving Personalized Mem- ory Management for Edge-Cloud Agents.Preprint, arXiv:2605.09530. Qingyan Guo, Rui Wang, Junliang Guo...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

InProceedings of the 17th Conference of the European Chapter of the Association for Computa- tional Linguistics: Tutorial Abstracts, pages 27–30, Dubrovnik, Croatia

Privacy-Preserving Natural Language Process- ing. InProceedings of the 17th Conference of the European Chapter of the Association for Computa- tional Linguistics: Tutorial Abstracts, pages 27–30, Dubrovnik, Croatia. Association for Computational Linguistics. Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yu...

work page arXiv 2024

[3] [3]

Autoglm: Autonomous foundation agents for guis

Curran Associates, Inc. Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, and Mike Zheng Shou. 2025. Computer-Use Agents as Judges for Generative User Interface.arXiv preprint. Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Jun- jun Sh...

work page arXiv 2025

[4] [4]

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Jiadai Sun, Xinyue Yang, Yu Yang, Shuntian Yao, Wei Xu, Jie Tang, and Yuxiao Dong

MemGPT: Towards LLMs as Operating Sys- tems.arXiv preprint. Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Jiadai Sun, Xinyue Yang, Yu Yang, Shuntian Yao, Wei Xu, Jie Tang, and Yuxiao Dong. 2025. We- bRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning.Inter- national Conference on Learning Representations, 2...

2025

[5] [5]

Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, and Sara- van Rajmohan

Mobile-Agent: Autonomous Multi-Modal Mo- bile Device Agent with Visual Perception.arXiv preprint. Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, and Sara- van Rajmohan. 2025a. Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM- Powered Agents.Preprint, arXiv:2509.17488. Shuai Wang, Weiw...

work page arXiv 2026

[6] [6]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Large Language Models as Optimizers.In- ternational Conference on Learning Representations, 2024:12028–12068. Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun- yuan Li, and Jianfeng Gao. 2023. Set-of-Mark Prompting Unleashes Extraordinary Visual Ground- ing in GPT-4V. https://arxiv.org/abs/2310.11441v2. Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Ferret-UI: Grounded mobile UI understanding with multimodal LLMs, 2024

MiniCPM-V: A GPT-4V Level MLLM on Your Phone.arXiv preprint. Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2024. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs.Preprint, arXiv:2404.05719. Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guest...

work page arXiv 2024

[8] [8]

InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, pages 1–20, New York, NY , USA

AppAgent: Multimodal Agents as Smartphone Users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, pages 1–20, New York, NY , USA. Association for Computing Machinery. Jie Zhang, Xiangkui Cao, Zhouyu Han, Shiguang Shan, and Xilin Chen. 2026. Multi-PA: A Multi- perspective Benchmark on Privacy Assessment for Large Vis...

work page arXiv 2025