pith. sign in

arxiv: 2604.12371 · v2 · submitted 2026-04-14 · 💻 cs.CV

Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

Pith reviewed 2026-05-10 16:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords typographic attacksvision-language modelsprompt injectionembedding alignmentadversarial robustnessattack success ratemultimodal embeddingsfont size effects
0
0 comments X

The pith

Closer text-image embedding alignment predicts higher success for typographic attacks on vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines typographic prompt injection attacks on vision-language models, where adversarial text is rendered into images to bypass safety filters. It evaluates one thousand prompts across four VLMs under controlled variations in font size and visual quality. The central result is a strong negative correlation between the distance separating text and image embeddings in two separate multimodal embedding models and the rate at which the attacks succeed. Smaller distances reliably correspond to higher attack success, and image transformations that widen the distance also reduce success. This connection offers a measurable way to anticipate when a VLM will follow instructions embedded in rendered text.

Core claim

The authors establish that text-image embedding distances computed by JinaCLIP and Qwen3-VL-Embedding correlate negatively with attack success rates across GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct, with correlation coefficients ranging from -0.71 to -0.93. Mid-range font sizes maximize success while very small fonts suppress it, text attacks outperform image attacks for some models, and degradations such as blur or noise widen embedding distances while cutting success rates by 34 to 96 percent.

What carries the argument

Text-image embedding distance from JinaCLIP and Qwen3-VL-Embedding, which quantifies alignment between the rendered adversarial text and its textual form and serves as a predictor of whether the VLM will execute the injected instruction.

If this is right

  • Mid-range font sizes produce peak attack success while fonts at 6 pixels yield near-zero success across all tested models.
  • Text-rendered attacks achieve higher success than image-only attacks for GPT-4o and Claude but show comparable rates for Mistral and Qwen3-VL.
  • Heavy visual degradations raise embedding distance by 10-12 percent and cut attack success by 34-96 percent.
  • Rotation reduces attack success for some models by up to 50 percent but leaves others unaffected, showing no single transformation protects every VLM.
  • Model-specific vulnerability patterns require tailored defenses rather than uniform solutions for agentic systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding distance could be computed in advance to rank prompts by likely vulnerability before they reach deployed VLMs in browser or robotic agents.
  • Fine-tuning VLMs to increase embedding distance specifically for suspicious rendered text might reduce typographic attack success without harming general performance.
  • The same distance metric might apply to other multimodal attacks that rely on cross-modal alignment to bypass safety layers.
  • Agent builders could prefer VLMs that maintain larger embedding distances under typical visual noise as a selection criterion for safety-critical uses.

Load-bearing premise

The two chosen embedding models accurately reflect the internal text-image alignment processes inside the target VLMs that determine whether the model follows the rendered adversarial text.

What would settle it

Re-measuring embedding distances with a third multimodal embedding model on the same set of 1000 prompts and checking whether the negative correlation with attack success rate remains above 0.7 across the four VLMs.

Figures

Figures reproduced from arXiv: 2604.12371 by Ankit Garg, Ravikumar Balakrishnan, Sanket Mendapara.

Figure 1
Figure 1. Figure 1: Example typographic prompt injection attack using the TAP jailbreak method. (a) Harmful [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Font size analysis. (a) JinaCLIP L2 distance decreases by 14% from 6px to 28px. (b) ASR [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Transformation analysis. (a) JinaCLIP L2 distance by transformation (sorted); dashed [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10--12% and reduce ASR by 34--96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript empirically studies typographic prompt injection attacks on VLMs by rendering adversarial text from 1,000 SALAD-Bench prompts as images and measuring attack success rates (ASR) on GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct. It varies font sizes (6-28 px) and transformations (rotation, blur, noise, contrast), reports that font size and modality affect ASR differently across models, and finds strong negative correlations (r = -0.71 to -0.93, p < 0.01) between text-image embedding distances from JinaCLIP and Qwen3-VL-Embedding and ASR, plus effects of degradations on both distance and ASR.

Significance. If the correlations are shown to track internal alignment rather than shared visual sensitivities, the work supplies useful empirical guidance for VLM selection in adversarial agentic settings and highlights model-specific robustness patterns. The proxy use of external multimodal embedders for alignment is a potentially valuable measurement approach, but its validity is central to the contribution.

major comments (3)
  1. [Abstract] Abstract, finding (3): the strong negative correlations (r = -0.71 to -0.93) between external embedding distances and ASR are presented as evidence linking text-image alignment to attack success, yet no ablation holds visual legibility constant (e.g., by fixing font size and transformations while varying semantic content) while measuring distance; this leaves open that the r values may be driven by low-level image properties affecting readability for any VLM rather than model-internal fusion.
  2. [Abstract] Abstract: no comparison is reported against the target VLMs' own text or vision encoders (available for Qwen3-VL-4B-Instruct and potentially extractable for others) to test whether the chosen external proxies (JinaCLIP, Qwen3-VL-Embedding) better predict ASR than the models' native alignment metrics.
  3. [Abstract] Abstract: the correlation analysis aggregates across font sizes and transformations without reported details on the rendering pipeline, exact prompt sampling, aggregation method per condition, multiple-testing correction, or confidence intervals on the r values, preventing assessment of whether the reported statistical significance is robust.
minor comments (1)
  1. [Abstract] Abstract: the statement that 'heavy degradations increase embedding distance by 10--12%' would be clearer with the exact distance metric (cosine, Euclidean) and baseline values.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We appreciate the referee's thorough review and constructive feedback on our work. We address each of the major comments in detail below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract, finding (3): the strong negative correlations (r = -0.71 to -0.93) between external embedding distances and ASR are presented as evidence linking text-image alignment to attack success, yet no ablation holds visual legibility constant (e.g., by fixing font size and transformations while varying semantic content) while measuring distance; this leaves open that the r values may be driven by low-level image properties affecting readability for any VLM rather than model-internal fusion.

    Authors: We acknowledge that the absence of an ablation holding visual legibility constant while varying semantic content leaves room for alternative interpretations of the correlations. The current results show consistent negative correlations across varied conditions, which we believe supports a link to alignment, but to rigorously address this, we will incorporate an additional controlled experiment in the revised manuscript. Specifically, we will fix font size and transformations and analyze subsets of prompts with differing semantic properties to measure embedding distances and their relation to ASR. revision: yes

  2. Referee: [Abstract] Abstract: no comparison is reported against the target VLMs' own text or vision encoders (available for Qwen3-VL-4B-Instruct and potentially extractable for others) to test whether the chosen external proxies (JinaCLIP, Qwen3-VL-Embedding) better predict ASR than the models' native alignment metrics.

    Authors: We agree that comparing to native encoders would strengthen the validity of the proxy approach. For the open-source Qwen3-VL-4B-Instruct, we will compute alignment using its own text and vision encoders and compare the predictive power for ASR against the external embedders. For the closed-source models, internal access is unavailable, so we will note this limitation and focus the comparison on the available model. revision: partial

  3. Referee: [Abstract] Abstract: the correlation analysis aggregates across font sizes and transformations without reported details on the rendering pipeline, exact prompt sampling, aggregation method per condition, multiple-testing correction, or confidence intervals on the r values, preventing assessment of whether the reported statistical significance is robust.

    Authors: The full manuscript provides details on the rendering pipeline (using PIL with specific font settings), prompt sampling (all 1,000 SALAD-Bench prompts), and aggregation (mean ASR per condition). However, to improve the abstract and ensure robustness assessment, we will add confidence intervals for the correlation coefficients and clarify the statistical methods, including that p-values are reported without correction for multiple comparisons as the analysis is primarily descriptive across models. revision: yes

standing simulated objections not resolved
  • Direct comparison to native encoders is not possible for closed-source models (GPT-4o, Claude Sonnet 4.5, Mistral-Large-3) due to inaccessible internals.

Circularity Check

0 steps flagged

No circularity: empirical correlation study with external measurements

full rationale

The paper presents an empirical evaluation of typographic attacks on four VLMs using 1,000 SALAD-Bench prompts under controlled visual variations. It directly measures attack success rates (ASR) and computes text-image embedding distances from two fixed external models (JinaCLIP and Qwen3-VL-Embedding), then reports Pearson correlations (r = -0.71 to -0.93). No equations, derivations, or predictions are defined that reduce to fitted parameters from the same data, self-citations, or ansatzes. The central claim is an observed statistical association on held-out data; embedding models are independent of the target VLMs' internal encoders, and no load-bearing premise relies on prior author work. This is a standard observational study without self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is purely empirical and relies on standard statistical assumptions for correlation testing rather than new theoretical constructs.

axioms (1)
  • standard math Pearson correlation assumes linear relationship and approximate normality of the underlying variables
    Invoked when reporting r values and p < 0.01 across embedding distances and ASR.

pith-pipeline@v0.9.0 · 5667 in / 1201 out tokens · 53464 ms · 2026-05-10T16:32:02.682336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

  1. [1]

    Jina clip: Your clip model is also your text retriever.arXiv preprint arXiv:2405.20204, 2024

    Andreas Koukounas, Georgios Mastrapas, Michael G ¨unther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martinez, Saahil Ognawala, Su- sana Guzman, Maximilian Werk, Nan Wang, and Han Xiao. Jina clip: Your clip model is also your text retriever.arXiv preprint arXiv:2405.20204,

  2. [2]

    Salad-bench: A hierarchical and comprehensive safety benchmark for large language mod- els

    Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad-bench: A hierarchical and comprehensive safety benchmark for large language mod- els. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 3923–3954,

  3. [3]

    arXiv preprint arXiv:2502.10486 , year=

    Qin Liu, Fei Wang, Chaowei Xiao, and Muhao Chen. VLM-Guard: Safeguarding vision-language models via fulfilling safety alignment gap.arXiv preprint arXiv:2502.10486,

  4. [4]

    Typographic attacks in a multi-image setting

    Xiaomeng Wang, Zhengyu Zhao, and Martha Larson. Typographic attacks in a multi-image setting. InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics,

  5. [5]

    SCAM: A real-world typographic robustness evaluation for multimodal foundation models,

    Justus Westerhoff, Erblina Purelku, Jakob Hackstein, Jonas Loos, Leo Pinetzki, Erik Rodner, and Lorenz Hufe. Scam: A real-world typographic robustness evaluation for multimodal foundation models.arXiv preprint arXiv:2504.04893,

  6. [6]

    Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning

    10 Published at ICLR 2026 Workshop on Agents in the Wild Ziqi Zhou, Shengshan Hu, Minghui Li, Hangtao Zhang, Yechao Zhang, and Hai Jin. Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning. InProceedings of the 31st ACM International Conference on Multimedia,