arxiv: 2605.10764 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

Mengqi He , Xinyu Tian , Xin Shen , Shu Zou , Jinhong Ni , Zhaoyuan Yang , Weikang Li , Xuesong Li

show 1 more author

Jing Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords jailbreakvision-language modelsentropy maximizationtransferabilityuntargeted attackautoregressive decodingmultimodal safetygradient-based attack

0 comments

The pith

Maximizing entropy at high-entropy refusal tokens flips VLM outputs to harmful content without fixed targets and improves cross-model transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Previous gradient-based image jailbreaks on vision-language models showed almost no transfer across models, which many interpreted as evidence that transferable multimodal attacks are hard. The authors trace this to optimization objectives that force a specific prefix or response pattern. Their key observation is that refusal decisions cluster at high-entropy tokens during generation, while non-refusal tokens already hold meaningful probability mass. They therefore introduce an untargeted attack that raises entropy only at those decision points and stabilizes low-entropy positions to keep output coherent. Experiments on three VLMs and two safety benchmarks show competitive white-box success plus markedly better transfer, including against several defenses.

Core claim

Refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Maximizing entropy at these decision tokens via UJEM-KL flips refusal outcomes while preserving output quality at low-entropy positions, yielding competitive white-box attack success rates and consistently higher transferability across models than prior constrained methods.

What carries the argument

UJEM-KL, an untargeted attack that maximizes entropy (via KL divergence) selectively at high-entropy decision tokens while constraining low-entropy positions.

If this is right

The method achieves competitive white-box attack success rates on three VLMs across two safety benchmarks.
Transferability improves consistently compared with prior gradient-based universal image jailbreaks.
The attack remains effective under representative defenses.
Limited transferability in earlier work stems primarily from overly constrained optimization objectives rather than inherent model robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety mechanisms in autoregressive VLMs may be more localized to high-entropy decision points than previously modeled.
Defenses could be strengthened by explicitly penalizing entropy spikes at candidate refusal tokens during decoding.
The same selective entropy approach might apply to other autoregressive multimodal or language-only models without requiring target strings.

Load-bearing premise

Refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack.

What would settle it

If an attack that instead maximizes entropy only at low-entropy tokens produces equal or higher transferability, or if UJEM-KL shows no transfer improvement over prior constrained objectives on the same models.

Figures

Figures reproduced from arXiv: 2605.10764 by Jing Zhang, Jinhong Ni, Mengqi He, Shu Zou, Weikang Li, Xin Shen, Xinyu Tian, Xuesong Li, Zhaoyuan Yang.

**Figure 2.** Figure 2: Attack success rate (ASR) under different objective-relaxed settings. As discussed above, overly constrained optimization objectives may be a key reason for the low transferability of VLM jailbreak attacks. We therefore consider a simple optimization-free baseline that weakens safety alignment by manipulating decoding configurations, e.g., increasing the temperature [11], and show the attack success rat… view at source ↗

**Figure 3.** Figure 3: Top-10 of High Entropy token shift after perturbation. Observation 1: Non-refusal tokens exist inherently. Safety-aligned LLMs are trained to refuse unsafe requests. When encountering harmful prompts, models often start responses with characteristic refusal phrases such as “I’m sorry”. A non-refusal token is any token that does not belong to the refusal token set, indicating the model did not trigger i… view at source ↗

**Figure 4.** Figure 4: Refusal mass at different tokens. Observation 2: Refusal concentrates at high-entropy decision points. Across multiple VLMs, refusal-indicative tokens (e.g., “sorry”, “cannot”) tend to appear at high entropy positions ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Judge sensitivity under untargeted jailbreak evaluation. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Case Study. Top: For the same clean image and the unsafe instruction, both Qwen-VL and LLaVA refuse on clean inputs. We then compare adversarial images crafted by a prior optimization-based baseline (SEA [31]) and our method (UJEMKL). On each model, both attacks trigger unsafe response behavior. Bottom: crossmodel transfer from Qwen→LLaVA using adversarial images optimized on Qwen-VL. SEA [31] fails to c… view at source ↗

**Figure 1.** Figure 1: Additional qualitative cases for Observation 3 and judge disagree [PITH_FULL_IMAGE:figures/full_fig_p021_1.png] view at source ↗

read the original abstract

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's entropy-maximization attack improves reported transferability for untargeted VLM jailbreaks, but the explanation for why it works depends on a preliminary and unverified claim about refusal tokens.

read the letter

The core point is that this work proposes UJEM-KL, an untargeted attack that maximizes entropy at high-entropy decision tokens while stabilizing low-entropy positions. It reports competitive white-box success and better cross-model transfer than prior gradient methods that lock in fixed prefixes or patterns. That distinction from constrained objectives is the main technical step forward, and the authors frame it under a threat model that avoids dictating the output text, which fits real-world use cases better than some earlier setups.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that limited cross-model transferability in prior gradient-based universal image jailbreaks on VLMs stems primarily from overly constrained optimization objectives (e.g., fixed prefixes or response patterns). A preliminary experiment is cited showing that refusal concentrates at high-entropy tokens during autoregressive decoding while non-refusal tokens already hold substantial top-ranked probability mass; this motivates the proposed UJEM-KL attack, which maximizes entropy at selected decision tokens and stabilizes low-entropy positions. Across three VLMs and two benchmarks the method reportedly achieves competitive white-box success rates, improved transferability, and remains effective under defenses.

Significance. If the entropy-concentration observation is robust and the transfer gains are reproducible with proper controls, the work would meaningfully advance VLM adversarial robustness by showing that relaxing objective constraints via entropy maximization can yield better transfer without target patterns. The lightweight nature of the attack and its reported resilience to defenses would be useful for safety evaluation, but the current lack of quantitative baselines, ablations, and statistical details limits the strength of the contribution.

major comments (2)

[Abstract] Abstract: the attribution of poor transferability to 'overly constrained optimization objectives' is load-bearing for the paper's motivation and conclusion, yet rests on a preliminary experiment whose details (prompt count, models, quantitative entropy distributions, or probability-mass comparisons) are not reported, leaving the key assumption unverified.
[Abstract] Abstract / Experimental Results: the claims of 'competitive white-box attack success rates' and 'consistently improves transferability' are presented without any numerical values, baseline comparisons, statistical significance tests, or ablation controls, making it impossible to assess whether the entropy-maximization component is responsible for the reported gains.

minor comments (1)

[Abstract] The acronym UJEM-KL is introduced in the abstract without expansion or a forward reference to its definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the attribution of poor transferability to 'overly constrained optimization objectives' is load-bearing for the paper's motivation and conclusion, yet rests on a preliminary experiment whose details (prompt count, models, quantitative entropy distributions, or probability-mass comparisons) are not reported, leaving the key assumption unverified.

Authors: We agree that the abstract should make the preliminary experiment verifiable. The full manuscript (Section 3.1) reports the experiment using 100 prompts sampled from the two safety benchmarks across the three VLMs, with quantitative entropy distributions and top-k probability mass comparisons provided in Figure 2 and the associated text. These show refusal concentrating at higher-entropy positions while non-refusal tokens already occupy substantial mass in the top candidates. We will revise the abstract to concisely include the prompt count, models, and key quantitative observations, with a pointer to Section 3.1 and Figure 2. revision: yes
Referee: [Abstract] Abstract / Experimental Results: the claims of 'competitive white-box attack success rates' and 'consistently improves transferability' are presented without any numerical values, baseline comparisons, statistical significance tests, or ablation controls, making it impossible to assess whether the entropy-maximization component is responsible for the reported gains.

Authors: The abstract is intentionally high-level, but the full manuscript (Section 4 and Tables 1-4) provides the requested quantitative details: white-box ASR values competitive with prior methods, transferability gains across models, direct baseline comparisons (including fixed-target gradient attacks), ablation studies isolating the entropy-maximization term, and statistical significance via repeated runs with t-tests. We will update the abstract to report representative numerical results and explicitly reference the ablations and controls so readers can immediately evaluate the contribution of the entropy-maximization component. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper motivates UJEM-KL from a preliminary empirical observation that refusal concentrates at high-entropy tokens and then demonstrates improved transferability via experiments on three VLMs. No equation or result is shown to reduce by construction to a fitted parameter, self-citation, or renamed input; the attribution of limited transferability to constrained objectives follows from comparative attack success rates rather than tautological redefinition. This is a standard empirical workflow with no load-bearing circular step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on an empirical observation about entropy concentration in autoregressive decoding; no explicit free parameters, mathematical axioms, or new postulated entities are introduced in the provided abstract.

pith-pipeline@v0.9.0 · 5500 in / 946 out tokens · 38549 ms · 2026-05-12T03:32:55.001710+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 10 internal anchors

[1]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zis- serman, A., Simonyan, K.: Flaming...

work page 2022
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. CoRRabs/2502.13923(2025).https: //doi.org/10.48550/ARXIV.2502.13923,http...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025
[3]

(eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/paper/2023/hash/ c1f0b856a35986348ab3414177266f75-Abstract-Conference.html

Carlini, N., Nasr, M., Choquette-Choo, C.A., Jagielski, M., Gao, I., Koh, P.W., Ippolito, D., Tramèr, F., Schmidt, L.: Are aligned neural networks adversarially aligned? In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/paper/2023/hash/ c1f0b856a35986348ab3414177266f75-Abstrac...

work page 2023
[4]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.C.H.: Instructblip: Towards general-purpose vision-language models with instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/paper/ 2023/hash/9a6a435e75419a836fe47ab6793623e6-...

work page 2023
[5]

In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

Gan, Z., Chen, Y., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial train- ing for vision-and-language representation learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020),https://proceedings. neurips.cc/paper/2020/hash/49562478de4c54fafd4ec46fdb297de5-Abstract. html

work page arXiv 2020
[6]

URL https://doi.org/10.1609/aaai.v39i22.34568

Gong, Y., Ran, D., Liu, J., Wang, C., Cong, T., Wang, A., Duan, S., Wang, X.: Figstep: Jailbreaking large vision-language models via typographic visual prompts. In: Walsh, T., Shah, J., Kolter, Z. (eds.) AAAI. pp. 23951–23959. AAAI Press (2025).https://doi.org/10.1609/AAAI.V39I22.34568,https://doi.org/10. 1609/aaai.v39i22.34568

work page doi:10.1609/aaai.v39i22.34568 2025
[7]

Explaining and Harnessing Adversarial Examples

Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015),http://arxiv.org/abs/ 1412.6572

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F

Guo, X., Yu, F., Zhang, H., Qin, L., Hu, B.: Cold-attack: Jailbreaking llms with stealthiness and controllability. In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) ICML. vol. 235, pp. 16974–17002. PMLR / OpenReview.net (2024),https://proceedings.mlr. press/v235/guo24i.html

work page 2024
[9]

High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models

He, M., Tian, X., Shen, X., Ni, J., Zou, S., Yang, Z., Zhang, J.: Few tokens mat- ter: Entropy guided attacks on vision-language models. CoRRabs/2512.21815 (2025).https://doi.org/10.48550/ARXIV.2512.21815,https://doi.org/10. 48550/arXiv.2512.21815

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.21815 2025
[10]

CoRRabs/2510.02999(2025).https://doi.org/ 10.48550/ARXIV.2510.02999,https://doi.org/10.48550/arXiv.2510.02999 Title Suppressed Due to Excessive Length 25

Huang, X., Hu, W., Zheng, T., Xiu, K., Jia, X., Wang, D., Qin, Z., Ren, K.: Untargeted jailbreak attack. CoRRabs/2510.02999(2025).https://doi.org/ 10.48550/ARXIV.2510.02999,https://doi.org/10.48550/arXiv.2510.02999 Title Suppressed Due to Excessive Length 25

work page doi:10.48550/arxiv.2510.02999 2025
[11]

In: ICLR

Huang, Y., Gupta, S., Xia, M., Li, K., Chen, D.: Catastrophic jailbreak of open- source llms via exploiting generation. In: ICLR. OpenReview.net (2024),https: //openreview.net/forum?id=r42tSSCHPh

work page 2024
[12]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., Khabsa, M.: Llama guard: Llm-based input- output safeguard for human-ai conversations. CoRRabs/2312.06674(2023). https://doi.org/10.48550/ARXIV.2312.06674,https://doi.org/10.48550/ arXiv.2312.06674

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.06674 2023
[13]

(eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/ paper / 2023 / hash / 5abcdf8ecdcacba028c6662789194572 - Abstract - Datasets _ and_Benchmarks.html

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicineinoneday.In:Oh,A.,Naumann,T.,Globerson,A.,Saenko,K.,Hardt, M., Levine, S. (eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/ paper / 2023 / hash / 5abcdf8ecdcacba028c6662789194572 - Abs...

work page 2023
[14]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) ICML. Pro- ceedings of Machine Learning Research, vol. 202, pp. 19730–19742. PMLR (2023), https://proceedings.mlr.press/v2...

work page 2023
[15]

CoRR abs/2509.21029(2025).https : / / doi

Lin, R., Paren, A., Yuan, S., Li, M., Torr, P., Bibi, A., Liu, T.: FORCE: trans- ferable visual jailbreaking attacks via feature over-reliance correction. CoRR abs/2509.21029(2025).https : / / doi . org / 10 . 48550 / ARXIV . 2509 . 21029, https://doi.org/10.48550/arXiv.2509.21029

work page doi:10.48550/arxiv.2509.21029 2025
[16]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) NeurIPS (2023),http : / / papers . nips . cc / paper _ files / paper / 2023 / hash / 6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html

work page 2023
[17]

Prompt Injection attack against LLM-integrated Applications

Liu, Y., Deng, G., Li, Y., Wang, K., Zhang, T., Liu, Y., Wang, H., Zheng, Y., Liu, Y.: Prompt injection attack against llm-integrated applications. CoRR abs/2306.05499(2023).https : / / doi . org / 10 . 48550 / ARXIV . 2306 . 05499, https://doi.org/10.48550/arXiv.2306.05499

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05499 2023
[18]

Tao, Z., Lin, T., Chen, X., Li, H., Wu, Y ., Li, Y ., Jin, Z., Huang, F., Tao, D., and Zhou, J

Luo, W., Ma, S., Liu, X., Guo, X., Xiao, C.: Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak at- tacks. CoRRabs/2404.03027(2024).https://doi.org/10.48550/ARXIV.2404. 03027,https://doi.org/10.48550/arXiv.2404.03027

work page doi:10.48550/arxiv.2404 2024
[19]

In: ICLR

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR. OpenReview.net (2018),https: //openreview.net/forum?id=rJzIBfZAb

work page 2018
[20]

In: International Conference on Learning Representations (2018)

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations (2018)

work page 2018
[21]

In: Salakhutdi- nov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D.A., Hendrycks, D.: Harmbench: A standardized eval- uation framework for automated red teaming and robust refusal. In: Salakhutdi- nov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) ICML. Proceedings of ...

work page 2024
[22]

CoRRabs/2509.21401(2025).https://doi.org/ 10.48550/ARXIV.2509.21401,https://doi.org/10.48550/arXiv.2509.21401 26 No Author Given

Mia, M.J., Amini, M.H.: Jailip: Jailbreaking vision-language models via loss guided image perturbation. CoRRabs/2509.21401(2025).https://doi.org/ 10.48550/ARXIV.2509.21401,https://doi.org/10.48550/arXiv.2509.21401 26 No Author Given

work page doi:10.48550/arxiv.2509.21401 2025
[23]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Oh, S., Jin, Y., Sharma, M., Kim, D., Ma, E., Verma, G., Kumar, S.: Uniguard: To- wardsuniversalsafetyguardrailsforjailbreakattacksonmultimodallargelanguage models. CoRRabs/2411.01703(2024).https://doi.org/10.48550/ARXIV. 2411.01703,https://doi.org/10.48550/arXiv.2411.01703

work page internal anchor Pith review doi:10.48550/arxiv 2024
[24]

GPT-4 Technical Report

OpenAI: GPT-4 technical report. CoRRabs/2303.08774(2023).https://doi. org/10.48550/ARXIV.2303.08774,https://doi.org/10.48550/arXiv.2303. 08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[25]

GPT-4o System Card

OpenAI: Gpt-4o system card. CoRRabs/2410.21276(2024).https://doi.org/ 10.48550/ARXIV.2410.21276,https://doi.org/10.48550/arXiv.2410.21276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.21276 2024
[26]

In: Wooldridge, M.J., Dy, J.G., Natarajan, S

Qi, X., Huang, K., Panda, A., Henderson, P., Wang, M., Mittal, P.: Visual ad- versarial examples jailbreak aligned large language models. In: Wooldridge, M.J., Dy, J.G., Natarajan, S. (eds.) AAAI. pp. 21527–21536. AAAI Press (2024). https://doi.org/10.1609/AAAI.V38I19.30150,https://doi.org/10.1609/ aaai.v38i19.30150

work page doi:10.1609/aaai.v38i19.30150 2024
[27]

CoRRabs/2410.03489(2024)

Rando, J., Korevaar, H., Brinkman, E., Evtimov, I., Tramèr, F.: Gradient-based jailbreak images for multimodal fusion models. CoRRabs/2410.03489(2024). https://doi.org/10.48550/ARXIV.2410.03489,https://doi.org/10.48550/ arXiv.2410.03489

work page doi:10.48550/arxiv.2410.03489 2024
[28]

CoRRabs/2505.21967(2025).https://doi.org/ 10.48550/ARXIV.2505.21967,https://doi.org/10.48550/arXiv.2505.21967

Ren, J., Dras, M., Naseem, U.: Seeing the threat: Vulnerabilities in vision-language models to adversarial attack. CoRRabs/2505.21967(2025).https://doi.org/ 10.48550/ARXIV.2505.21967,https://doi.org/10.48550/arXiv.2505.21967

work page doi:10.48550/arxiv.2505.21967 2025
[29]

In: ICLR

Schaeffer, R., Valentine, D., Bailey, L., Chua, J., Eyzaguirre, C., Durante, Z., Benton, J., Miranda, B., Sleight, H., Wang, T.T., Hughes, J., Agrawal, R., Sharma, M., Emmons, S., Koyejo, S., Perez, E.: Failures to find transferable im- age jailbreaks between vision-language models. In: ICLR. OpenReview.net (2025), https://openreview.net/forum?id=wvFnqVVUhN

work page 2025
[30]

Education Sciences16(1) (2026).https://doi.org/10.3390/ educsci16010123,https://www.mdpi.com/2227-7102/16/1/123

Tian, J.: Vision-language models in teaching and learning: A systematic lit- erature review. Education Sciences16(1) (2026).https://doi.org/10.3390/ educsci16010123,https://www.mdpi.com/2227-7102/16/1/123

work page 2026
[31]

CoRR abs/2508.01741(2025).https : / / doi

Wang, R., Wang, X., Yao, Y., Tong, X., Ma, X.: Simulated ensemble at- tack: Transferring jailbreaks across fine-tuned vision-language models. CoRR abs/2508.01741(2025).https : / / doi . org / 10 . 48550 / ARXIV . 2508 . 01741, https://doi.org/10.48550/arXiv.2508.01741

work page doi:10.48550/arxiv.2508.01741 2025
[32]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.18265 2025
[33]

Assran, Q

Wang, Z., Yang, H., Feng, Y., Sun, P., Guo, H., Zhang, Z., Ren, K.: Towards trans- ferable targeted adversarial examples. In: CVPR. pp. 20534–20543. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.01967,https://doi.org/10.1109/ CVPR52729.2023.01967 Title Suppressed Due to Excessive Length 27

work page doi:10.1109/cvpr52729.2023.01967 2023
[34]

In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023

Waseda, F., Nishikawa, S., Le, T., Nguyen, H.H., Echizen, I.: Closer look at the transferability of adversarial examples: How they fool different models dif- ferently. In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023. pp. 1360–1368. IEEE (2023). https://doi.org/10.1109/WACV56688.2023.00141,...

work page doi:10.1109/wacv56688.2023.00141 2023
[35]

In: Ku, L., Mar- tins, A., Srikumar, V

Xu, Z., Jiang, F., Niu, L., Jia, J., Lin, B.Y., Poovendran, R.: Safedecoding: Defending against jailbreak attacks via safety-aware decoding. In: Ku, L., Mar- tins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024. pp. 558...

work page doi:10.18653/v1/2024.acl-long.303 2024
[36]

What’s in the image? a deep-dive into the vision of vision language models

Xu, Z., Bai, Y., Zhang, Y., Li, Z., Xia, F., Wong, K.K., Wang, J., Zhao, H.: Drivegpt4-v2: Harnessing large language model capabilities for enhanced closed- loop autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. pp. 17261– 17270. Computer Vision Foundation / IEEE (202...

work page arXiv 2025
[37]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Yang, J., Zhang, Z., Cui, S., Wang, H., Huang, M.: Guiding not forcing: En- hancing the transferability of jailbreaking attacks on llms via removing super- fluous constraints. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 202...

work page 2025
[38]

Ying, Z., Liu, A., Liang, S., Huang, L., Guo, J., Zhou, W., Liu, X., Tao, D.: Safebench: A safety evaluation framework for multimodal large language models. Int. J. Comput. Vis.134(1), 18 (2026).https://doi.org/10.1007/S11263-025- 02613-1,https://doi.org/10.1007/s11263-025-02613-1

work page doi:10.1007/s11263-025- 2026
[39]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Yoon, S., Jeung, W., No, A.: R-TOFU: unlearning in large reasoning models. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025. pp. 5239–5258. Association for Computational Linguistics (2025).https://doi....

work page doi:10.18653/v1/2025.emnlp- 2025
[40]

CoRRabs/2502.21059(2025)

Zhang, Z., Sun, Z., Zhang, Z., Guo, J., He, X.: Fc-attack: Jailbreaking large vision- language models via auto-generated flowcharts. CoRRabs/2502.21059(2025). https://doi.org/10.48550/ARXIV.2502.21059,https://doi.org/10.48550/ arXiv.2502.21059

work page doi:10.48550/arxiv.2502.21059 2025
[41]

In: Tan, J., Toussaint, M., Darvish, K

Zitkovich,B.,Yu,T.,Xu,S.,Xu,P.,Xiao,T.,Xia,F.,Wu,J.,Wohlhart,P.,Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H.T., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.E., Leal, I., Kuang, Y., Kalashnikov, D.,...

work page 2023
[42]

PMLR (2023),https://proceedings.mlr.press/v229/zitkovich23a.html

work page 2023
[43]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable ad- versarial attacks on aligned language models. CoRRabs/2307.15043(2023). https://doi.org/10.48550/ARXIV.2307.15043,https://doi.org/10.48550/ arXiv.2307.15043

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.15043 2023