Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
Pith reviewed 2026-05-12 03:32 UTC · model grok-4.3
The pith
Maximizing entropy at high-entropy refusal tokens flips VLM outputs to harmful content without fixed targets and improves cross-model transfer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Maximizing entropy at these decision tokens via UJEM-KL flips refusal outcomes while preserving output quality at low-entropy positions, yielding competitive white-box attack success rates and consistently higher transferability across models than prior constrained methods.
What carries the argument
UJEM-KL, an untargeted attack that maximizes entropy (via KL divergence) selectively at high-entropy decision tokens while constraining low-entropy positions.
If this is right
- The method achieves competitive white-box attack success rates on three VLMs across two safety benchmarks.
- Transferability improves consistently compared with prior gradient-based universal image jailbreaks.
- The attack remains effective under representative defenses.
- Limited transferability in earlier work stems primarily from overly constrained optimization objectives rather than inherent model robustness.
Where Pith is reading between the lines
- Safety mechanisms in autoregressive VLMs may be more localized to high-entropy decision points than previously modeled.
- Defenses could be strengthened by explicitly penalizing entropy spikes at candidate refusal tokens during decoding.
- The same selective entropy approach might apply to other autoregressive multimodal or language-only models without requiring target strings.
Load-bearing premise
Refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack.
What would settle it
If an attack that instead maximizes entropy only at low-entropy tokens produces equal or higher transferability, or if UJEM-KL shows no transfer improvement over prior constrained objectives on the same models.
Figures
read the original abstract
Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that limited cross-model transferability in prior gradient-based universal image jailbreaks on VLMs stems primarily from overly constrained optimization objectives (e.g., fixed prefixes or response patterns). A preliminary experiment is cited showing that refusal concentrates at high-entropy tokens during autoregressive decoding while non-refusal tokens already hold substantial top-ranked probability mass; this motivates the proposed UJEM-KL attack, which maximizes entropy at selected decision tokens and stabilizes low-entropy positions. Across three VLMs and two benchmarks the method reportedly achieves competitive white-box success rates, improved transferability, and remains effective under defenses.
Significance. If the entropy-concentration observation is robust and the transfer gains are reproducible with proper controls, the work would meaningfully advance VLM adversarial robustness by showing that relaxing objective constraints via entropy maximization can yield better transfer without target patterns. The lightweight nature of the attack and its reported resilience to defenses would be useful for safety evaluation, but the current lack of quantitative baselines, ablations, and statistical details limits the strength of the contribution.
major comments (2)
- [Abstract] Abstract: the attribution of poor transferability to 'overly constrained optimization objectives' is load-bearing for the paper's motivation and conclusion, yet rests on a preliminary experiment whose details (prompt count, models, quantitative entropy distributions, or probability-mass comparisons) are not reported, leaving the key assumption unverified.
- [Abstract] Abstract / Experimental Results: the claims of 'competitive white-box attack success rates' and 'consistently improves transferability' are presented without any numerical values, baseline comparisons, statistical significance tests, or ablation controls, making it impossible to assess whether the entropy-maximization component is responsible for the reported gains.
minor comments (1)
- [Abstract] The acronym UJEM-KL is introduced in the abstract without expansion or a forward reference to its definition.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the attribution of poor transferability to 'overly constrained optimization objectives' is load-bearing for the paper's motivation and conclusion, yet rests on a preliminary experiment whose details (prompt count, models, quantitative entropy distributions, or probability-mass comparisons) are not reported, leaving the key assumption unverified.
Authors: We agree that the abstract should make the preliminary experiment verifiable. The full manuscript (Section 3.1) reports the experiment using 100 prompts sampled from the two safety benchmarks across the three VLMs, with quantitative entropy distributions and top-k probability mass comparisons provided in Figure 2 and the associated text. These show refusal concentrating at higher-entropy positions while non-refusal tokens already occupy substantial mass in the top candidates. We will revise the abstract to concisely include the prompt count, models, and key quantitative observations, with a pointer to Section 3.1 and Figure 2. revision: yes
-
Referee: [Abstract] Abstract / Experimental Results: the claims of 'competitive white-box attack success rates' and 'consistently improves transferability' are presented without any numerical values, baseline comparisons, statistical significance tests, or ablation controls, making it impossible to assess whether the entropy-maximization component is responsible for the reported gains.
Authors: The abstract is intentionally high-level, but the full manuscript (Section 4 and Tables 1-4) provides the requested quantitative details: white-box ASR values competitive with prior methods, transferability gains across models, direct baseline comparisons (including fixed-target gradient attacks), ablation studies isolating the entropy-maximization term, and statistical significance via repeated runs with t-tests. We will update the abstract to report representative numerical results and explicitly reference the ablations and controls so readers can immediately evaluate the contribution of the entropy-maximization component. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper motivates UJEM-KL from a preliminary empirical observation that refusal concentrates at high-entropy tokens and then demonstrates improved transferability via experiments on three VLMs. No equation or result is shown to reduce by construction to a fitted parameter, self-citation, or renamed input; the attribution of limited transferability to constrained objectives follows from comparative attack success rates rather than tautological redefinition. This is a standard empirical workflow with no load-bearing circular step.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.