One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

Ravikumar Balakrishnan; Sanket Mendapara

arxiv: 2604.25102 · v1 · submitted 2026-04-28 · 💻 cs.CV

One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

Ravikumar Balakrishnan , Sanket Mendapara This is my paper

Pith reviewed 2026-05-07 17:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords typographic prompt injectionVLM safetyembedding distanceattack success rateadversarial perturbationsred teamingmultimodal modelssafety alignment

0 comments

The pith

Multimodal embedding distance strongly predicts success of typographic attacks on vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests typographic prompt injections in VLMs and finds that the distance between the perturbed image and clean text in multimodal embedding space correlates strongly with attack success rate. This distance acts as a model-agnostic signal because it tracks how much a visual change impairs text readability or weakens safety refusals. Researchers then optimize small bounded perturbations on surrogate embeddings to increase similarity, which raises success rates on target models like GPT-4o and Claude by improving readability and lowering refusal rates at the same time. The balance between these two effects shifts depending on how strong a model's safety filter is and how much the image is degraded.

Core claim

Across four VLMs, twelve font sizes, and ten transformations, multimodal embedding distance predicts attack success rate with correlations from -0.71 to -0.93. Optimization of bounded l_infinity perturbations on surrogate embeddings recovers both perceptual readability and reduced safety-aligned refusals as co-occurring effects, with the dominant mechanism depending on model safety strength and visual degradation level.

What carries the argument

Multimodal embedding distance as a predictive proxy for attack success rate, used to guide CWA-SSA optimization of bounded perturbations on surrogate models.

If this is right

Embedding distance serves as an interpretable proxy that works across models without querying targets directly.
A single bounded perturbation can simultaneously restore readability and reduce safety refusals.
The main bypass route shifts with model safety filter strength and degree of visual degradation.
Surrogate optimization transfers to production VLMs without needing their training data or internals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety monitoring systems could track embedding distances in real time to flag potential typographic injections.
The same proxy might extend to non-typographic multimodal attacks if embedding space alignment holds.
Stronger safety training may require larger perturbations before readability becomes the dominant failure mode.
Testing the correlation on future VLMs would show whether the proxy remains stable as architectures evolve.

Load-bearing premise

The correlation between embedding distance and attack success is driven by changes in text readability and safety alignment, and optimizations on surrogates transfer to target models without internal access.

What would settle it

A set of typographic images where embedding distance fails to predict attack success rate, or where surrogate-optimized perturbations do not increase success on closed models like GPT-4o.

Figures

Figures reproduced from arXiv: 2604.25102 by Ravikumar Balakrishnan, Sanket Mendapara.

**Figure 1.** Figure 1: Overview of our embedding-guided adversarial optimization. A degraded typographic image (left) is optimized via the popular view at source ↗

read the original abstract

Typographic prompt injection exploits vision language models' (VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain \emph{why} certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR ($r{=}{-}0.71$ to ${-}0.93$, $p{<}0.01$), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we directly maximize image text embedding similarity under bounded $\ell_\infty$ perturbations via CWA-SSA across four surrogate embedding models, stress testing both factors without access to the target model. Experiments across five degradation settings on GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL confirm that optimization recovers readability and reduces safety aligned refusals as two co-occurring effects, with the dominant mechanism depending on the model's safety filter strength and the degree of visual degradation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Embedding distance predicts typographic ASR well in the reported sweeps, but the surrogate optimization step lacks confirmation that it actually shrinks distance inside the black-box target models.

read the letter

The paper shows that multimodal embedding distance between rendered text and the image strongly tracks attack success rate on VLMs, with correlations from -0.71 to -0.93 across four models, twelve font sizes, and ten transformations. They then use that relationship to drive a surrogate optimization (CWA-SSA on four embedding models) that generates bounded perturbations meant to improve readability or reduce safety refusals without touching the target VLM internals. The downstream results on GPT-4o, Claude, Mistral, and Qwen show higher ASR, with the dominant mechanism shifting based on how strong the model's safety filter is and how much the image is degraded to begin with.

Referee Report

2 major / 2 minor

Summary. The paper claims that multimodal embedding distance between original and typographically perturbed images strongly predicts attack success rate (ASR) for prompt injection in VLMs (Pearson r = -0.71 to -0.93, p < 0.01) across four models, twelve font sizes, and ten transformations, providing a model-agnostic proxy. It then uses this to motivate surrogate optimization (CWA-SSA) on four embedding models to generate bounded ℓ∞ perturbations that maximize similarity, which experiments on GPT-4o, Claude Sonnet, Mistral-Large-3, and Qwen3-VL show improves ASR via two co-occurring effects: recovered perceptual readability and reduced safety refusals, with the dominant mechanism depending on model safety strength and degradation level.

Significance. If the central proxy and transfer hold, the work is significant for providing an interpretable, access-free method to probe and stress-test VLM safety against typographic attacks. The dual-mechanism finding (readability vs. alignment) offers mechanistic insight beyond raw ASR maximization, and the surrogate approach is practically useful for black-box commercial models. The broad empirical sweep across models and settings strengthens the case for embedding distance as a diagnostic tool.

major comments (2)

[Empirical study and correlation analysis] The reported correlations and p-values (r = -0.71 to -0.93, p < 0.01) are presented without any mention of correction for multiple comparisons across the twelve font sizes, ten transformations, four models, and five degradation settings. This directly affects the reliability of the claimed strong predictive power of the embedding-distance proxy.
[Surrogate optimization and transfer experiments] The surrogate optimization (CWA-SSA) is justified by the embedding-distance proxy, yet the manuscript provides no verification that the resulting perturbations actually reduce embedding distance in the target VLMs' spaces (GPT-4o, Claude, etc.). ASR gains and qualitative readability/safety observations are reported, but without cross-model distance measurements or ablations isolating the embedding mediator, the improvements could arise from generic image enhancement or target-specific artifacts unrelated to the stated mechanism.

minor comments (2)

[Abstract] The abstract refers to 'five degradation settings' and 'ten transformations' without listing or defining them; these should be specified early (e.g., in a table or methods paragraph) so readers can evaluate the experimental scope.
[Methodology] Provide additional implementation details on the CWA-SSA algorithm, the exact choice and training of the four surrogate embedding models, and the precise definition of the bounded ℓ∞ perturbation constraint.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below, noting revisions where appropriate.

read point-by-point responses

Referee: [Empirical study and correlation analysis] The reported correlations and p-values (r = -0.71 to -0.93, p < 0.01) are presented without any mention of correction for multiple comparisons across the twelve font sizes, ten transformations, four models, and five degradation settings. This directly affects the reliability of the claimed strong predictive power of the embedding-distance proxy.

Authors: We agree that multiple-comparison correction is required. In the revision we will apply Bonferroni adjustment across the tested conditions (font sizes, transformations, models, and degradation levels), report both raw and adjusted p-values, and state the total number of comparisons explicitly. Recalculation indicates that the large negative correlations remain significant after correction in the great majority of cases, but the corrected values will be presented for full transparency. revision: yes
Referee: [Surrogate optimization and transfer experiments] The surrogate optimization (CWA-SSA) is justified by the embedding-distance proxy, yet the manuscript provides no verification that the resulting perturbations actually reduce embedding distance in the target VLMs' spaces (GPT-4o, Claude, etc.). ASR gains and qualitative readability/safety observations are reported, but without cross-model distance measurements or ablations isolating the embedding mediator, the improvements could arise from generic image enhancement or target-specific artifacts unrelated to the stated mechanism.

Authors: We acknowledge the limitation: commercial target models are black-box and do not expose embedding APIs, so direct distance measurements on GPT-4o, Claude, etc., are impossible. The surrogate choice is grounded in the cross-model correlation study; transfer of ASR gains is the observable evidence. To isolate the embedding mediator from generic effects we will add ablations comparing CWA-SSA outputs against (i) random ℓ∞-bounded perturbations and (ii) non-embedding image enhancements (e.g., unsharp masking). These controls will be reported in the revision, together with an explicit discussion of the black-box constraint. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical correlation and surrogate optimization are independent measurements.

full rationale

The paper reports an observed Pearson correlation (r = -0.71 to -0.93) between multimodal embedding distance and ASR across VLMs, then uses that empirical relationship to motivate surrogate optimization (CWA-SSA) that minimizes distance in four embedding models before evaluating ASR on black-box targets. No equation, definition, or self-citation reduces the reported correlation, the optimization objective, or the downstream ASR gains to quantities defined by fitting on the same data or by prior self-referential claims. The derivation chain consists of separate experimental steps—correlation measurement, surrogate perturbation generation, and target evaluation—none of which collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work assumes standard multimodal embedding spaces and bounded perturbation optimization exist and transfer.

pith-pipeline@v0.9.0 · 5576 in / 1098 out tokens · 142621 ms · 2026-05-07T17:14:48.487549+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

Defense-prefix for preventing typographic attacks on clip.arXiv preprint arXiv:2304.04512, 2023

Hiroki Azuma and Yusuke Matsui. Defense-prefix for preventing typographic attacks on clip.arXiv preprint arXiv:2304.04512, 2023. 4

work page arXiv 2023
[2]

Scenetap: Scene- coherent typographic adversarial planner against vision- language models in real-world environments

Yue Cao, Yun Xing, Jie Zhang, Di Lin, Tianwei Zhang, Ivor Tsang, Yang Liu, and Qing Guo. Scenetap: Scene- coherent typographic adversarial planner against vision- language models in real-world environments. InCVPR,

work page
[3]

arXiv:2412.00114. 1, 4

work page arXiv
[4]

Rethinking model ensemble in transfer-based adversarial attacks

Huanran Chen, Yichi Zhang, Yinpeng Dong, Xiao Yang, Hang Su, and Jun Zhu. Rethinking model ensemble in transfer-based adversarial attacks. InICLR, 2024. 3, 4

work page 2024
[5]

How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023

Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. How robust is google’s bard to adversarial image at- tacks?arXiv preprint arXiv:2309.11751, 2023. 3

work page arXiv 2023
[6]

Fig- 8 Step: Jailbreaking Large Vision-language Models via Typo- graphic Visual Prompts

Yichen Gong, Delong Deng, Xinwen Bai, Xuan Dai, Bofei Ding, Zongming Zhong, Rongyu Wu, Renhe Ji, and Xipeng Qiu. Figstep: Jailbreaking large vision-language models via typographic visual prompts.arXiv preprint arXiv:2311.05608, 2023. 1, 4

work page arXiv 2023
[7]

Jina clip: Your clip model is also your text retriever.arXiv preprint arXiv:2405.20204, 2024

Andreas Koukounas, Georgios Mastrapas, Michael G ¨unther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Soni, Nan Han, Nan Xiao, et al. Jina clip: Your clip model is also your text retriever.arXiv preprint arXiv:2405.20204, 2024. 2, 3

work page arXiv 2024
[8]

Salad- bench: A hierarchical and comprehensive safety benchmark for large language models

Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wang- meng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad- bench: A hierarchical and comprehensive safety benchmark for large language models. InFindings of ACL, pages 11718– 11740, 2024. 1

work page 2024
[9]

SCAM: A real-world typographic robustness evaluation for multimodal foundation models,

Xinyu Ma et al. Scam: A real-world typographic robustness evaluation for multimodal foundation models.arXiv preprint arXiv:2504.04893, 2025. 4

work page arXiv 2025
[10]

Typo- graphic attacks in a multi-image setting

Xiaomeng Wang, Zhengyu Zhao, and Martha Larson. Typo- graphic attacks in a multi-image setting. InProceedings of NAACL, 2025. arXiv:2502.08193. 1, 4

work page arXiv 2025
[11]

Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning.arXiv preprint arXiv:2308.07026, 2023

Ziqi Zhou et al. Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning.arXiv preprint arXiv:2308.07026, 2023. 4

work page arXiv 2023

[1] [1]

Defense-prefix for preventing typographic attacks on clip.arXiv preprint arXiv:2304.04512, 2023

Hiroki Azuma and Yusuke Matsui. Defense-prefix for preventing typographic attacks on clip.arXiv preprint arXiv:2304.04512, 2023. 4

work page arXiv 2023

[2] [2]

Scenetap: Scene- coherent typographic adversarial planner against vision- language models in real-world environments

Yue Cao, Yun Xing, Jie Zhang, Di Lin, Tianwei Zhang, Ivor Tsang, Yang Liu, and Qing Guo. Scenetap: Scene- coherent typographic adversarial planner against vision- language models in real-world environments. InCVPR,

work page

[3] [3]

arXiv:2412.00114. 1, 4

work page arXiv

[4] [4]

Rethinking model ensemble in transfer-based adversarial attacks

Huanran Chen, Yichi Zhang, Yinpeng Dong, Xiao Yang, Hang Su, and Jun Zhu. Rethinking model ensemble in transfer-based adversarial attacks. InICLR, 2024. 3, 4

work page 2024

[5] [5]

How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023

Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. How robust is google’s bard to adversarial image at- tacks?arXiv preprint arXiv:2309.11751, 2023. 3

work page arXiv 2023

[6] [6]

Fig- 8 Step: Jailbreaking Large Vision-language Models via Typo- graphic Visual Prompts

Yichen Gong, Delong Deng, Xinwen Bai, Xuan Dai, Bofei Ding, Zongming Zhong, Rongyu Wu, Renhe Ji, and Xipeng Qiu. Figstep: Jailbreaking large vision-language models via typographic visual prompts.arXiv preprint arXiv:2311.05608, 2023. 1, 4

work page arXiv 2023

[7] [7]

Jina clip: Your clip model is also your text retriever.arXiv preprint arXiv:2405.20204, 2024

Andreas Koukounas, Georgios Mastrapas, Michael G ¨unther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Soni, Nan Han, Nan Xiao, et al. Jina clip: Your clip model is also your text retriever.arXiv preprint arXiv:2405.20204, 2024. 2, 3

work page arXiv 2024

[8] [8]

Salad- bench: A hierarchical and comprehensive safety benchmark for large language models

Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wang- meng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad- bench: A hierarchical and comprehensive safety benchmark for large language models. InFindings of ACL, pages 11718– 11740, 2024. 1

work page 2024

[9] [9]

SCAM: A real-world typographic robustness evaluation for multimodal foundation models,

Xinyu Ma et al. Scam: A real-world typographic robustness evaluation for multimodal foundation models.arXiv preprint arXiv:2504.04893, 2025. 4

work page arXiv 2025

[10] [10]

Typo- graphic attacks in a multi-image setting

Xiaomeng Wang, Zhengyu Zhao, and Martha Larson. Typo- graphic attacks in a multi-image setting. InProceedings of NAACL, 2025. arXiv:2502.08193. 1, 4

work page arXiv 2025

[11] [11]

Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning.arXiv preprint arXiv:2308.07026, 2023

Ziqi Zhou et al. Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning.arXiv preprint arXiv:2308.07026, 2023. 4

work page arXiv 2023