VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents

Chen Zhao; Yitian Qian; Youting Wang; Yuan Tang

arxiv: 2606.07595 · v1 · pith:CKR5LNSQnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI· cs.IR

VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents

Youting Wang , Yuan Tang , Yitian Qian , Chen Zhao This is my paper

Pith reviewed 2026-06-28 22:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.IR

keywords vision language agentsaction boundary propagationVisualLeakBenchPII leakageunsafe texttool argumentsbenchmark evaluationdefensive prompts

0 comments

The pith

Vision-language agents copy sensitive text from images into tool arguments at rates of 79 to 86 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VisualLeakBench to study how vision-language agents propagate visible sensitive or unsafe text from images into downstream tool calls. It evaluates four production VLMs on a 100-image subset across note capture and external handoff workflows. Baseline results show high propagation for both PII and unsafe text, with defensive prompts reducing PII propagation mainly by avoiding tool use. The work measures visual-to-tool leakage and provides an oracle diagnostic to localize failures at the tool boundary.

Core claim

VisualLeakBench shows that target strings are propagated into tool arguments in 78.8% of PII cases and 85.5% of rendered unsafe-text cases at baseline. Under a defensive system prompt, rendered unsafe-text propagation stays at 52.6% while PII falls to 2.0% largely by suppressing tool use. Propagation rates depend on the tool surface, and most failures localize at the tool boundary rather than in responses.

What carries the argument

The VisualLeakBench benchmark, a diversified 500-image set with a 100-image stratified subset evaluated on note capture and external handoff workflows, which quantifies visual-to-tool propagation of target strings.

If this is right

Tool-surface dependence means search-like tools can suppress PII but not unsafe text propagation.
Defensive prompts reduce PII tool propagation to 2% but leave unsafe-text at over 50%.
The labeled-target oracle upper-bound diagnostic shows most failures occur at the tool boundary.
Response-side leakage remains as a residual risk even when tool propagation is controlled.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents may need explicit boundary-aware training or architectures to prevent copying visible text without understanding context.
The benchmark could be extended to test mitigation strategies like output filtering or fine-tuning on boundary examples.
Similar propagation issues might appear in other multimodal systems handling documents or UIs.
Real-world deployments involving screenshots could face data leakage risks not captured by current safety evaluations.

Load-bearing premise

The 500-image benchmark and 100-image subset with the two workflows represent the action-boundary propagation failures typical in real production vision-language agent systems.

What would settle it

Finding propagation rates under 20% for both PII and unsafe text on a diverse set of new images from production-like environments would indicate the benchmark overestimates the failure mode.

Figures

Figures reproduced from arXiv: 2606.07595 by Chen Zhao, Yitian Qian, Youting Wang, Yuan Tang.

**Figure 1.** Figure 1: Trace diagnostic. A visual target may be absent from the visible response yet present in tool arguments, so classification and guard diagnostics inspect the action boundary. sponse leakage, safe tool calls, tool-only propagation, and response-plus-tool leakage. Fourth, we evaluate defensive prompting and show that mitigation can be sharply asymmetric across PII and rendered unsafe text. Finally, we provid… view at source ↗

**Figure 2.** Figure 2: Qualitative action-boundary propagation trace. The target value is synthetic; both panels redact the exact value while preserving the tool-boundary failure. 5. Trace Diagnostics Trace-level agent evaluation should expose where the failure occurs inside the trace. We categorize each non-error trace into five mutually exclusive classes: no tool call and no response leak, safe tool call without target propag… view at source ↗

read the original abstract

Vision-language agents increasingly consume screenshots, documents, and user interfaces before writing to memory, sending messages, or invoking external tools. We study a concrete failure mode in this setting: action-boundary propagation, where sensitive or unsafe visible text is copied from an image into downstream tool arguments. We present VisualLeakBench, a diversified 500-image benchmark spanning UI, chat, document, form, and dashboard scenes, and evaluate a stratified 100-image agent subset with four production VLM systems under two workflows: note capture and external handoff. At baseline, target strings are propagated into tool arguments in 78.8% of PII cases and 85.5% of rendered unsafe-text cases. Under a defensive system prompt, rendered unsafe-text propagation remains high at 52.6%, while PII tool propagation falls to 2.0%, largely by suppressing tool use rather than preserving utility. Rates are tool-surface dependent: search-like tools suppress PII propagation, but rendered unsafe text still crosses tool boundaries. We measure visual-to-tool propagation rather than downstream instruction execution. We additionally provide a labeled-target oracle upper-bound diagnostic that localizes most failures at the tool boundary while leaving response-side leakage as residual risk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VisualLeakBench gives concrete leakage rates for VL agents but the benchmark's match to real deployments is unvalidated.

read the letter

The paper's main takeaway is that vision-language agents copy sensitive text from images into tool arguments at high rates—78.8% for PII and 85.5% for unsafe text at baseline across four systems. A defensive prompt drops PII leakage to 2% but leaves unsafe-text leakage at 52.6%, and the effect varies by tool type. They also show most failures happen at the tool boundary via an oracle check.

What is new is VisualLeakBench itself: a 500-image set covering UI, chat, document, form, and dashboard scenes, plus a 100-image stratified subset tested on two workflows (note capture and external handoff). The measurements are direct empirical counts rather than fitted models, and the tool-surface dependence is a practical observation that prior work had not quantified this way.

The paper does a clean job of isolating the propagation issue and reporting numbers that can be checked. The diversified scenes and multiple systems give it some breadth.

The soft spot is representativeness. The rates only speak beyond this benchmark if the chosen images and workflows reflect actual production agent use, but there is no external distribution check or expert validation against real deployments. Without details on how the images were constructed and labeled, selection effects cannot be ruled out. If the full methods section addresses these points with transparent protocols, the concern shrinks; otherwise the numbers stay benchmark-specific.

This is for people working on agent safety and VLM tool-use robustness. A reader who needs a starting point for measuring visual-to-tool leakage would get value from the benchmark and the reported rates. It deserves peer review so referees can examine the image construction and labeling process and push on generalizability.

Referee Report

2 major / 1 minor

Summary. The paper introduces VisualLeakBench, a diversified 500-image benchmark spanning UI, chat, document, form, and dashboard scenes, to study action-boundary propagation failures in vision-language agents where visible sensitive or unsafe text is copied into downstream tool arguments. It evaluates a stratified 100-image subset with four production VLM systems under note-capture and external-handoff workflows, reporting baseline propagation rates of 78.8% for PII cases and 85.5% for rendered unsafe-text cases; under a defensive system prompt these fall to 2.0% and 52.6% respectively, with additional analysis of tool-surface dependence and a labeled-target oracle diagnostic that localizes most failures at the tool boundary.

Significance. If the benchmark is representative, the concrete, reproducible rates would establish a clear and persistent vulnerability in current VL agents, showing that defensive prompts are largely ineffective against unsafe-text leakage while PII leakage is mitigated mainly by suppressing tool use. The strengths include the provision of a public benchmark, the oracle upper-bound diagnostic, and the demonstration of tool-surface dependence, all of which support falsifiable follow-up work.

major comments (2)

[Benchmark construction and evaluation setup] The central claim that the reported rates (78.8% PII, 85.5% unsafe-text at baseline) indicate action-boundary failures in vision-language agents depends on the representativeness of the 500-image set and 100-image subset; the manuscript supplies no external validation, distribution comparison to production deployments, or expert assessment confirming that the chosen scenes and note-capture/external-handoff workflows match real VL agent usage patterns.
[Methods] The abstract and methods provide concrete percentages from the four systems but omit details on image construction, target-string labeling protocol, and statistical controls for the stratified subset, preventing assessment of selection bias or post-hoc choices that could affect the measured propagation rates.

minor comments (1)

[Abstract and evaluation] The distinction between visual-to-tool propagation and downstream instruction execution could be stated more explicitly when introducing the oracle diagnostic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for identifying key areas where additional transparency and discussion are warranted. We address each major comment below and describe the planned revisions.

read point-by-point responses

Referee: [Benchmark construction and evaluation setup] The central claim that the reported rates (78.8% PII, 85.5% unsafe-text at baseline) indicate action-boundary failures in vision-language agents depends on the representativeness of the 500-image set and 100-image subset; the manuscript supplies no external validation, distribution comparison to production deployments, or expert assessment confirming that the chosen scenes and note-capture/external-handoff workflows match real VL agent usage patterns.

Authors: We agree that the absence of external validation or production-distribution matching limits the strength of any claim about exact real-world prevalence. The benchmark was constructed as a controlled, diversified collection across five scene categories chosen to reflect common VL-agent inputs, with workflows drawn from standard agent designs. We will add an expanded Limitations section that explicitly states the synthetic nature of the data, the lack of external validation or expert review against production logs, and the consequent need to interpret the reported rates as evidence that the failure mode can occur at high frequency rather than as precise estimates of deployment risk. The public benchmark release is intended to support exactly the follow-up validation studies the referee correctly identifies as missing. revision: partial
Referee: [Methods] The abstract and methods provide concrete percentages from the four systems but omit details on image construction, target-string labeling protocol, and statistical controls for the stratified subset, preventing assessment of selection bias or post-hoc choices that could affect the measured propagation rates.

Authors: The referee is correct that these procedural details are currently insufficient. In the revised manuscript we will insert a dedicated subsection in Methods that describes: (i) the image-construction pipeline for each of the five scene categories, (ii) the exact target-string identification and labeling protocol (including verification steps), and (iii) the stratification criteria, randomization procedure, and any balancing statistics used to form the 100-image evaluation subset. These additions will enable independent assessment of selection bias. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct measurements

full rationale

The paper presents VisualLeakBench as an empirical evaluation of propagation rates in VL agents across defined image sets and workflows. All reported figures (78.8% PII, 85.5% unsafe-text at baseline; 52.6% and 2.0% under defensive prompt) are direct counts from the described experiments on the 500-image and 100-image subsets. No equations, fitted parameters, self-citations for uniqueness theorems, or ansatzes appear in the provided text. The central claims rest on observable tool-argument outputs rather than any derivation that reduces to its own inputs by construction. The representativeness concern raised by the skeptic is a question of external validity, not circularity in the reported measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the representativeness of the benchmark images and evaluation protocols; no free parameters, axioms beyond standard empirical methods, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5755 in / 1246 out tokens · 30133 ms · 2026-06-28T22:57:41.572366+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 8 canonical work pages · 4 internal anchors

[1]

30th USENIX Security Symposium (USENIX Security 21) , year=

Extracting Training Data from Large Language Models , author=. 30th USENIX Security Symposium (USENIX Security 21) , year=
[2]

International Conference on Learning Representations , year=

Quantifying Memorization Across Neural Language Models , author=. International Conference on Learning Representations , year=
[3]

Scalable Extraction of Training Data from (Production) Language Models

Scalable Extraction of Training Data from (Production) Language Models , author=. arXiv preprint arXiv:2311.17035 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Gong, Yichen and Ran, Delong and Liu, Jinyuan and Wang, Conglei and Cong, Tianshuo and Wang, Anyu and Duan, Sisi and Wang, Xiaoyun , booktitle=
[5]

Not What You've Signed Up For: Compromising Real-World

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , journal=. Not What You've Signed Up For: Compromising Real-World
[6]

arXiv preprint arXiv:2309.00236 , year=

Image Hijacks: Adversarial Images can Control Generative Models at Runtime , author=. arXiv preprint arXiv:2309.00236 , year=

work page arXiv
[7]

Liu, Xin and Zhu, Yichen and Lan, Yunshi and Yang, Chao and Qiao, Yu , booktitle=
[8]

Yuan, Tongxin and He, Zhiwei and Dong, Ling and Wang, Yinpeng and Zhao, Ruijie and Xia, Tian and Xu, Lizhen and Zhu, Binglin and Li, Fangqi and Zhang, Zhuosheng and Wang, Rui and Liu, Gongshen , booktitle=
[9]

Li, Haoran and Guo, Dadi and Li, Donghao and Fan, Wei and Hu, Qi and Liu, Xin and Chan, Chunkit and Yao, Duanyi and Yao, Yuan and Song, Yangqiu , booktitle=
[10]

arXiv preprint arXiv:2302.00539 , year=

Analyzing Leakage of Personally Identifiable Information in Language Models , author=. arXiv preprint arXiv:2302.00539 , year=

work page arXiv
[11]

and Hashimoto, Tatsunori , booktitle=

Ruan, Yangjun and Dong, Honghua and Wang, Andrew and Pitis, Silviu and Zhou, Yongchao and Ba, Jimmy and Dubois, Yann and Maddison, Chris J. and Hashimoto, Tatsunori , booktitle=. Identifying the Risks of
[12]

Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , booktitle=
[13]

GAIA: a benchmark for General AI Assistants

Mialon, Gr. arXiv preprint arXiv:2311.12983 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Advances in Neural Information Processing Systems , year=

Debenedetti, Edoardo and Zhang, Jie and Balunovi. Advances in Neural Information Processing Systems , year=
[16]

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , booktitle=
[17]

arXiv preprint arXiv:2306.13213 , year=

Visual Adversarial Examples Jailbreak Aligned Large Language Models , author=. arXiv preprint arXiv:2306.13213 , year=

work page arXiv
[18]

arXiv preprint arXiv:2403.09792 , year=

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models , author=. arXiv preprint arXiv:2403.09792 , year=

work page arXiv
[19]

Zhang, Yichi and Huang, Yao and Sun, Yitong and Liu, Chang and Zhao, Zhe and Fang, Zhengwei and Wang, Yifan and Chen, Huanran and Yang, Xiao and Wei, Xingxing and Su, Hang and Dong, Yinpeng and Zhu, Jun , booktitle=
[20]

Luo, Weidi and Ma, Siyuan and Liu, Xiaogeng and Guo, Xiaoyu and Xiao, Chaowei , booktitle=
[21]

USENIX Security Symposium , year=

Formalizing and Benchmarking Prompt Injection Attacks and Defenses , author=. USENIX Security Symposium , year=
[22]

Weng, Fenghua and Xu, Yue and Fu, Chengyan and Wang, Wenjie , booktitle=
[23]

Wang, Youting and Tang, Yuan and Qian, Yitian and Zhao, Chen , journal=
[24]

2026 9th International Symposium on Big Data and Applied Statistics (ISBDAS) , pages=

Do Transformers Always Win? An Empirical Study of Semantic Embeddings for Short-Text E-commerce Reviews , author=. 2026 9th International Symposium on Big Data and Applied Statistics (ISBDAS) , pages=. 2026 , doi=

2026
[25]

Resolving the Robustness-Precision Trade-off in Financial

Cheng, Zhiyuan and Lai, Longying and Liu, Yue , journal=. Resolving the Robustness-Precision Trade-off in Financial. 2026 , url=

2026
[26]

Enhancing Financial Report Question-Answering: A Retrieval-Augmented Generation System with Reranking Analysis

Enhancing Financial Report Question-Answering: A Retrieval-Augmented Generation System with Reranking Analysis , author=. arXiv preprint arXiv:2603.16877 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

2025 , url=

Chen, Wei and Wu, Liangmin and Hu, Yunhai and Li, Zhiyuan and Cheng, Zhiyuan and Qian, Yicheng and Zhu, Lingyue and Hu, Zhipeng and Liang, Luoyi and Tang, Qiang and Liu, Zhen and Yang, Han , journal=. 2025 , url=

2025
[28]

Available at SSRN 6321958 , year=

Regime-dependent Volatility Dynamics: Evidence from Time-Series Analysis , author=. Available at SSRN 6321958 , year=

[1] [1]

30th USENIX Security Symposium (USENIX Security 21) , year=

Extracting Training Data from Large Language Models , author=. 30th USENIX Security Symposium (USENIX Security 21) , year=

[2] [2]

International Conference on Learning Representations , year=

Quantifying Memorization Across Neural Language Models , author=. International Conference on Learning Representations , year=

[3] [3]

Scalable Extraction of Training Data from (Production) Language Models

Scalable Extraction of Training Data from (Production) Language Models , author=. arXiv preprint arXiv:2311.17035 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Gong, Yichen and Ran, Delong and Liu, Jinyuan and Wang, Conglei and Cong, Tianshuo and Wang, Anyu and Duan, Sisi and Wang, Xiaoyun , booktitle=

[5] [5]

Not What You've Signed Up For: Compromising Real-World

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , journal=. Not What You've Signed Up For: Compromising Real-World

[6] [6]

arXiv preprint arXiv:2309.00236 , year=

Image Hijacks: Adversarial Images can Control Generative Models at Runtime , author=. arXiv preprint arXiv:2309.00236 , year=

work page arXiv

[7] [7]

Liu, Xin and Zhu, Yichen and Lan, Yunshi and Yang, Chao and Qiao, Yu , booktitle=

[8] [8]

Yuan, Tongxin and He, Zhiwei and Dong, Ling and Wang, Yinpeng and Zhao, Ruijie and Xia, Tian and Xu, Lizhen and Zhu, Binglin and Li, Fangqi and Zhang, Zhuosheng and Wang, Rui and Liu, Gongshen , booktitle=

[9] [9]

Li, Haoran and Guo, Dadi and Li, Donghao and Fan, Wei and Hu, Qi and Liu, Xin and Chan, Chunkit and Yao, Duanyi and Yao, Yuan and Song, Yangqiu , booktitle=

[10] [10]

arXiv preprint arXiv:2302.00539 , year=

Analyzing Leakage of Personally Identifiable Information in Language Models , author=. arXiv preprint arXiv:2302.00539 , year=

work page arXiv

[11] [11]

and Hashimoto, Tatsunori , booktitle=

Ruan, Yangjun and Dong, Honghua and Wang, Andrew and Pitis, Silviu and Zhou, Yongchao and Ba, Jimmy and Dubois, Yann and Maddison, Chris J. and Hashimoto, Tatsunori , booktitle=. Identifying the Risks of

[12] [12]

Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , booktitle=

[13] [13]

GAIA: a benchmark for General AI Assistants

Mialon, Gr. arXiv preprint arXiv:2311.12983 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Advances in Neural Information Processing Systems , year=

Debenedetti, Edoardo and Zhang, Jie and Balunovi. Advances in Neural Information Processing Systems , year=

[16] [16]

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , booktitle=

[17] [17]

arXiv preprint arXiv:2306.13213 , year=

Visual Adversarial Examples Jailbreak Aligned Large Language Models , author=. arXiv preprint arXiv:2306.13213 , year=

work page arXiv

[18] [18]

arXiv preprint arXiv:2403.09792 , year=

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models , author=. arXiv preprint arXiv:2403.09792 , year=

work page arXiv

[19] [19]

Zhang, Yichi and Huang, Yao and Sun, Yitong and Liu, Chang and Zhao, Zhe and Fang, Zhengwei and Wang, Yifan and Chen, Huanran and Yang, Xiao and Wei, Xingxing and Su, Hang and Dong, Yinpeng and Zhu, Jun , booktitle=

[20] [20]

Luo, Weidi and Ma, Siyuan and Liu, Xiaogeng and Guo, Xiaoyu and Xiao, Chaowei , booktitle=

[21] [21]

USENIX Security Symposium , year=

Formalizing and Benchmarking Prompt Injection Attacks and Defenses , author=. USENIX Security Symposium , year=

[22] [22]

Weng, Fenghua and Xu, Yue and Fu, Chengyan and Wang, Wenjie , booktitle=

[23] [23]

Wang, Youting and Tang, Yuan and Qian, Yitian and Zhao, Chen , journal=

[24] [24]

2026 9th International Symposium on Big Data and Applied Statistics (ISBDAS) , pages=

Do Transformers Always Win? An Empirical Study of Semantic Embeddings for Short-Text E-commerce Reviews , author=. 2026 9th International Symposium on Big Data and Applied Statistics (ISBDAS) , pages=. 2026 , doi=

2026

[25] [25]

Resolving the Robustness-Precision Trade-off in Financial

Cheng, Zhiyuan and Lai, Longying and Liu, Yue , journal=. Resolving the Robustness-Precision Trade-off in Financial. 2026 , url=

2026

[26] [26]

Enhancing Financial Report Question-Answering: A Retrieval-Augmented Generation System with Reranking Analysis

Enhancing Financial Report Question-Answering: A Retrieval-Augmented Generation System with Reranking Analysis , author=. arXiv preprint arXiv:2603.16877 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

2025 , url=

Chen, Wei and Wu, Liangmin and Hu, Yunhai and Li, Zhiyuan and Cheng, Zhiyuan and Qian, Yicheng and Zhu, Lingyue and Hu, Zhipeng and Liang, Luoyi and Tang, Qiang and Liu, Zhen and Yang, Han , journal=. 2025 , url=

2025

[28] [28]

Available at SSRN 6321958 , year=

Regime-dependent Volatility Dynamics: Evidence from Time-Series Analysis , author=. Available at SSRN 6321958 , year=