High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models

Jing Zhang; Jinhong Ni; Mengqi He; Shu Zou; Xin Shen; Xinyu Tian; Zhaoyuan Yang

arxiv: 2512.21815 · v3 · pith:KA7ICOTDnew · submitted 2025-12-26 · 💻 cs.CV · cs.LG

High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models

Mengqi He , Xinyu Tian , Xin Shen , Jinhong Ni , Shu Zou , Zhaoyuan Yang , Jing Zhang This is my paper

Pith reviewed 2026-05-16 19:26 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords adversarial attacksvision-language modelsentropyhigh-entropy tokensautoregressive generationtransferabilityunsafe outputssparse perturbations

0 comments

The pith

A small share of high-entropy tokens during generation concentrates most adversarial influence in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models generate text autoregressively and measure uncertainty at each step via token entropy. The paper shows that roughly 20 percent of these high-entropy tokens account for a disproportionate share of the effect when an adversary perturbs the input image. Directing limited perturbations only at those positions produces semantic degradation comparable to attacks that modify every position. The same high-entropy positions recur across models with different architectures, so the resulting attacks transfer without retraining. The targeted attacks also raise the rate of unsafe model outputs to 20-31 percent, prompting the authors to build an Entropy-Guided Attack that reuses a token bank for efficiency.

Core claim

In representative open-source vision-language models, a small fraction of tokens (around 20 percent) with the highest entropy during autoregressive decoding concentrates the bulk of adversarial influence from image perturbations. Concentrating optimization on these positions achieves semantic degradation comparable to global attacks while updating fewer tokens. The vulnerable high-entropy tokens appear consistently across architecturally diverse models, supporting non-trivial attack transferability. The same targeted perturbations also generate a substantial unsafe output subset (20-31 percent). The authors operationalize the finding in an Entropy-Guided Attack that reaches 93-95 percent成功率

What carries the argument

High-entropy tokens identified on the fly during autoregressive decoding, where entropy quantifies per-step model uncertainty and serves as a sparse target for image perturbations.

If this is right

Sparse attacks on 20 percent of tokens match the semantic effect of full-position attacks.
High-entropy positions recur across diverse VLMs, enabling transfer without per-model retuning.
Targeted attacks raise unsafe output rates to 20-31 percent under standard evaluation pipelines.
An Entropy-Guided Attack using a reusable token bank reaches 93-95 percent success with 30-38 percent harmful rate on three open-source VLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenses could monitor entropy spikes during generation to flag potential adversarial inputs.
Similar concentration patterns may appear in other autoregressive multimodal systems beyond VLMs.
Reusable token banks could be adapted for efficient red-teaming or for studying failure modes in new model releases.

Load-bearing premise

High-entropy tokens can be identified reliably during generation and concentrating perturbations on them produces comparable degradation without needing post-hoc selection or model-specific tuning that would break transferability.

What would settle it

A controlled test in which attacks restricted to low-entropy tokens achieve equal or higher semantic degradation than attacks on high-entropy tokens, or in which the identity of the top 20 percent entropy positions shifts substantially across model runs or architectures.

Figures

Figures reproduced from arXiv: 2512.21815 by Jing Zhang, Jinhong Ni, Mengqi He, Shu Zou, Xin Shen, Xinyu Tian, Zhaoyuan Yang.

**Figure 2.** Figure 2: The ∆CIDEr distribution w.r.t. the selected top p% highentropy tokens, showcasing 20% is sufficient. is least confident, and are usually associated with semantic errors or unstable reasoning trajectories [5]. Recent work has shown that globally maximizing multiple forms of nexttoken entropy can destabilize caption generation [21]. However, mounting evidence suggests that not all tokens contribute equal… view at source ↗

**Figure 3.** Figure 3: Harmful Pie Chart. Nested pies for captioning on three VLMs (left [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Harmful Rate with different image condition while keep [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Transferability performance w.r.t. (left) CIDEr drop and [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 8.** Figure 8: Ablation on entropy selection. Both Qwen and InternVL [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: (a) shows the current token flip rate distribution vs the [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: The examples of attack of seven categories, including Illegal Activity, Violence, Hate, Self-Harm, Privacy, Sexual, and Other [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Route-wise attribution of harmful behavior. (a) ASR-LLM by route. (b) Harmful-rate uplift by route. We report results for three captioning VLMs across the seven image/prefix routes in our switching experiment: fully adversarial (Adv); adversarial prefix with clean, white, or no image (Img clean, Img white, Img none); adversarial image with clean or sanitized prefix (Pref clean, Pref san); and the clean ba… view at source ↗

read the original abstract

Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, as a measure of model uncertainty, is highly correlated with VLM reliability. While prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token equally contributes to model instability, we reveal that a small fraction (around 20%) of high-entropy tokens, in the evaluated representative open-source VLMs with diverse architectures, concentrates a disproportionate share of adversarial influence during autoregressive generation. We demonstrate that concentrating adversarial perturbations on these high-entropy positions achieves comparable semantic degradation to global methods while optimizing fewer decoding positions. Additionally, across multiple representative VLMs, such attacks induce not only semantic drift but also a substantial unsafe subset (20-31%) under the current pipeline. Remarkably, since such vulnerable high-entropy tokens recur across architecturally diverse VLMs, attacks focused on them exhibit non-trivial transferability. Motivated by these findings, we design a simple Entropy-Guided Attack (EGA) that operationalizes sparse high-entropy targeting and extends it with a reusable token bank, yielding competitive attack success rates (93-95%) with a considerable harmful rate (30.2-38.6%) on the three representative open-source VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

High-entropy tokens concentrate most adversarial vulnerability in VLMs and support a reusable bank for transferable attacks.

read the letter

The main thing here is that adversarial influence in these VLMs is not uniform across decoding steps. A small slice of high-entropy tokens, around 20 percent, carries a disproportionate share of the effect, and the same positions recur across different architectures. Targeting just those positions gives comparable semantic degradation to attacking every token while using fewer changes, and it also drives 20-31 percent of outputs into unsafe territory. The reusable token bank is the practical piece that lets them claim non-trivial transfer between models without retraining the attack each time. They report 93-95 percent success rates on three open VLMs with 30-38 percent harmful outputs, which is concrete enough to notice. The shift from blanket entropy maximization to sparse targeting is the actual extension beyond prior work. The abstract states the pattern clearly and shows the bank works in their setup. The soft spots are the missing controls. No error bars, no ablation on the exact 20 percent cutoff, and no full protocol for how the entropy map is built or how the bank is populated. The stress-test concern about post-hoc selection on clean runs is worth checking: if the high-entropy positions are identified from a preliminary pass on the target model, that adds model-specific information and could weaken pure black-box transfer even with the bank. The paper claims recurrence helps, but the details on whether the bank avoids that dependency are not in the abstract. This is for people working on VLM robustness and safety interventions. The empirical concentration pattern is worth verifying with proper controls, so it deserves a serious referee rather than a desk reject.

Referee Report

4 major / 2 minor

Summary. The paper claims that approximately 20% of high-entropy tokens during autoregressive generation in vision-language models concentrate a disproportionate share of adversarial influence. It introduces an Entropy-Guided Attack (EGA) that targets these sparse positions, augmented by a reusable token bank, achieving 93-95% attack success rates and 30.2-38.6% harmful output rates on three representative open-source VLMs while demonstrating non-trivial cross-model transferability.

Significance. If the empirical results hold under rigorous verification, the work identifies a sparse, recurring set of failure points that could enable more efficient and transferable adversarial attacks on VLMs. The concrete attack success numbers and the observation of elevated unsafe outputs (20-31%) provide actionable insights for multimodal safety research, particularly if the reusable token bank proves model-agnostic.

major comments (4)

[Transferability Experiments] Transferability section: The reusable token bank is presented as enabling cross-VLM attacks, but the manuscript does not explicitly demonstrate that high-entropy position identification avoids a preliminary clean forward pass on the target model. If selection requires target-specific entropy maps, this post-hoc conditioning undermines the black-box transferability claim and the assertion that attacks remain effective without model-specific tuning.
[Results] Results section: Key statistics (e.g., ~20% high-entropy fraction, 93-95% ASR, 20-31% unsafe subset) are reported as consistent across VLMs without error bars, standard deviations, trial counts, or random seed details. This absence makes it impossible to evaluate whether the disproportionate influence claim is statistically robust or sensitive to sampling variation.
[Method] Method section: The protocol for computing per-step entropy and selecting the top 20% positions during generation lacks sufficient detail on whether the process is fully online (without clean-run precomputation) or requires model-specific access. This detail is load-bearing for the central claim that high-entropy targeting generalizes without post-hoc selection.
[Ablation Studies] Ablation studies: No ablation is provided on the 20% threshold itself or comparisons against other fractions (e.g., 10% or 30%), which is necessary to substantiate that this specific fraction uniquely concentrates adversarial influence rather than being an arbitrary cutoff.

minor comments (2)

[Abstract] The abstract and introduction should explicitly list the three evaluated VLMs with their architectures and parameter scales to ground the 'diverse architectures' claim.
[Figures] Figure captions for entropy distributions and attack visualizations should include the exact number of prompts or images used and any preprocessing steps.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We appreciate the recognition of the potential impact of identifying high-entropy tokens as multimodal failure points. Below, we provide point-by-point responses to the major comments and indicate the revisions we will make to address them.

read point-by-point responses

Referee: [Transferability Experiments] Transferability section: The reusable token bank is presented as enabling cross-VLM attacks, but the manuscript does not explicitly demonstrate that high-entropy position identification avoids a preliminary clean forward pass on the target model. If selection requires target-specific entropy maps, this post-hoc conditioning undermines the black-box transferability claim and the assertion that attacks remain effective without model-specific tuning.

Authors: We clarify that in the transferability experiments, the high-entropy positions and the reusable token bank are derived exclusively from the source model. The perturbations are then applied to the target model using the pre-identified positions and tokens from the bank, without performing any clean forward pass or entropy computation on the target model. This maintains the black-box setting. However, we agree that the manuscript lacks explicit description of this protocol. We will revise the Transferability section to detail this process, including pseudocode or a diagram if necessary, to substantiate the claim. revision: yes
Referee: [Results] Results section: Key statistics (e.g., ~20% high-entropy fraction, 93-95% ASR, 20-31% unsafe subset) are reported as consistent across VLMs without error bars, standard deviations, trial counts, or random seed details. This absence makes it impossible to evaluate whether the disproportionate influence claim is statistically robust or sensitive to sampling variation.

Authors: The referee is correct that the current manuscript does not include error bars or details on the number of trials and random seeds. We will update the Results section to report all key metrics with mean and standard deviation over multiple runs (e.g., 5 independent trials with different seeds), and specify the random seeds used for reproducibility. This will allow better assessment of the robustness of the findings. revision: yes
Referee: [Method] Method section: The protocol for computing per-step entropy and selecting the top 20% positions during generation lacks sufficient detail on whether the process is fully online (without clean-run precomputation) or requires model-specific access. This detail is load-bearing for the central claim that high-entropy targeting generalizes without post-hoc selection.

Authors: We acknowledge the need for greater clarity in the Method section. The entropy computation is performed online during the autoregressive generation process on the model being attacked, using the model's own output probabilities at each step to calculate entropy and select the top 20% positions dynamically. No precomputation on a clean run is required beyond the standard generation. We will expand the Method section with a step-by-step description and pseudocode to make this explicit, confirming that it does not rely on model-specific pre-access beyond what is needed for the attack. revision: yes
Referee: [Ablation Studies] Ablation studies: No ablation is provided on the 20% threshold itself or comparisons against other fractions (e.g., 10% or 30%), which is necessary to substantiate that this specific fraction uniquely concentrates adversarial influence rather than being an arbitrary cutoff.

Authors: We agree that an ablation on the threshold would strengthen the paper. We will add a new ablation study in the Experiments section, evaluating attack performance at different high-entropy fractions (10%, 15%, 20%, 25%, 30%) across the VLMs. This will demonstrate that 20% provides an effective balance between sparsity and attack efficacy, supporting our choice. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical measurements

full rationale

The paper reports observational findings from entropy computations and attack experiments on multiple VLMs. High-entropy token identification occurs via per-step entropy calculation during autoregressive generation, followed by targeted perturbation testing; success rates and transferability are measured outcomes, not quantities derived by construction from fitted parameters or prior self-citations. No load-bearing step reduces to self-definition, renaming, or an imported uniqueness theorem. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on the standard assumption that entropy reliably tracks adversarial vulnerability and on empirical observation rather than new free parameters, axioms, or invented entities.

axioms (1)

domain assumption Entropy serves as a valid proxy for model uncertainty that correlates with adversarial vulnerability
Invoked to justify identifying high-entropy tokens as the primary attack targets.

pith-pipeline@v0.9.0 · 5539 in / 1265 out tokens · 39394 ms · 2026-05-16T19:26:18.378293+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we reveal that a small fraction (around 20%) of high-entropy tokens... concentrates a disproportionate share of adversarial influence... Entropy-Guided Attack (EGA) that operationalizes sparse high-entropy targeting
IndisputableMonolith/Foundation/LogicAsFunctionalEquation.lean SatisfiesLawsOfLogic unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

token entropy is thus defined as H_t(p_t(w)) = −∑ p_t(w) log p_t(w)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
cs.CV 2026-05 unverdicted novelty 6.0

UJEM-KL improves cross-model transferability of untargeted jailbreaks on vision-language models by maximizing entropy at decision tokens instead of forcing specific outputs.