pith. machine review for the scientific record. sign in

arxiv: 2605.10764 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords jailbreakvision-language modelsentropy maximizationtransferabilityuntargeted attackautoregressive decodingmultimodal safetygradient-based attack
0
0 comments X

The pith

Maximizing entropy at high-entropy refusal tokens flips VLM outputs to harmful content without fixed targets and improves cross-model transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Previous gradient-based image jailbreaks on vision-language models showed almost no transfer across models, which many interpreted as evidence that transferable multimodal attacks are hard. The authors trace this to optimization objectives that force a specific prefix or response pattern. Their key observation is that refusal decisions cluster at high-entropy tokens during generation, while non-refusal tokens already hold meaningful probability mass. They therefore introduce an untargeted attack that raises entropy only at those decision points and stabilizes low-entropy positions to keep output coherent. Experiments on three VLMs and two safety benchmarks show competitive white-box success plus markedly better transfer, including against several defenses.

Core claim

Refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Maximizing entropy at these decision tokens via UJEM-KL flips refusal outcomes while preserving output quality at low-entropy positions, yielding competitive white-box attack success rates and consistently higher transferability across models than prior constrained methods.

What carries the argument

UJEM-KL, an untargeted attack that maximizes entropy (via KL divergence) selectively at high-entropy decision tokens while constraining low-entropy positions.

If this is right

  • The method achieves competitive white-box attack success rates on three VLMs across two safety benchmarks.
  • Transferability improves consistently compared with prior gradient-based universal image jailbreaks.
  • The attack remains effective under representative defenses.
  • Limited transferability in earlier work stems primarily from overly constrained optimization objectives rather than inherent model robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety mechanisms in autoregressive VLMs may be more localized to high-entropy decision points than previously modeled.
  • Defenses could be strengthened by explicitly penalizing entropy spikes at candidate refusal tokens during decoding.
  • The same selective entropy approach might apply to other autoregressive multimodal or language-only models without requiring target strings.

Load-bearing premise

Refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack.

What would settle it

If an attack that instead maximizes entropy only at low-entropy tokens produces equal or higher transferability, or if UJEM-KL shows no transfer improvement over prior constrained objectives on the same models.

Figures

Figures reproduced from arXiv: 2605.10764 by Jing Zhang, Jinhong Ni, Mengqi He, Shu Zou, Weikang Li, Xin Shen, Xinyu Tian, Xuesong Li, Zhaoyuan Yang.

Figure 1
Figure 1. Figure 1: Comparison between our untargeted multimodal jailbreak (right) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Attack success rate (ASR) under different objective-relaxed settings. As discussed above, overly con￾strained optimization objectives may be a key reason for the low transferability of VLM jailbreak attacks. We therefore consider a simple optimization-free baseline that weakens safety alignment by manipulating decoding configura￾tions, e.g., increasing the tempera￾ture [11], and show the attack success rat… view at source ↗
Figure 3
Figure 3. Figure 3: Top-10 of High Entropy token shift after perturbation. Observation 1: Non-refusal to￾kens exist inherently. Safety-aligned LLMs are trained to refuse unsafe requests. When en￾countering harmful prompts, mod￾els often start responses with char￾acteristic refusal phrases such as “I’m sorry”. A non-refusal token is any token that does not belong to the refusal token set, indicating the model did not trigger i… view at source ↗
Figure 4
Figure 4. Figure 4: Refusal mass at dif￾ferent tokens. Observation 2: Refusal concentrates at high-entropy decision points. Across multi￾ple VLMs, refusal-indicative tokens (e.g., “sorry”, “cannot”) tend to appear at high entropy posi￾tions ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Judge sensitivity under untargeted jailbreak evaluation. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case Study. Top: For the same clean image and the unsafe instruction, both Qwen-VL and LLaVA refuse on clean inputs. We then compare adversarial images crafted by a prior optimization-based baseline (SEA [31]) and our method (UJEM￾KL). On each model, both attacks trigger unsafe response behavior. Bottom: cross￾model transfer from Qwen→LLaVA using adversarial images optimized on Qwen-VL. SEA [31] fails to c… view at source ↗
Figure 1
Figure 1. Figure 1: Additional qualitative cases for Observation 3 and judge disagree [PITH_FULL_IMAGE:figures/full_fig_p021_1.png] view at source ↗
read the original abstract

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that limited cross-model transferability in prior gradient-based universal image jailbreaks on VLMs stems primarily from overly constrained optimization objectives (e.g., fixed prefixes or response patterns). A preliminary experiment is cited showing that refusal concentrates at high-entropy tokens during autoregressive decoding while non-refusal tokens already hold substantial top-ranked probability mass; this motivates the proposed UJEM-KL attack, which maximizes entropy at selected decision tokens and stabilizes low-entropy positions. Across three VLMs and two benchmarks the method reportedly achieves competitive white-box success rates, improved transferability, and remains effective under defenses.

Significance. If the entropy-concentration observation is robust and the transfer gains are reproducible with proper controls, the work would meaningfully advance VLM adversarial robustness by showing that relaxing objective constraints via entropy maximization can yield better transfer without target patterns. The lightweight nature of the attack and its reported resilience to defenses would be useful for safety evaluation, but the current lack of quantitative baselines, ablations, and statistical details limits the strength of the contribution.

major comments (2)
  1. [Abstract] Abstract: the attribution of poor transferability to 'overly constrained optimization objectives' is load-bearing for the paper's motivation and conclusion, yet rests on a preliminary experiment whose details (prompt count, models, quantitative entropy distributions, or probability-mass comparisons) are not reported, leaving the key assumption unverified.
  2. [Abstract] Abstract / Experimental Results: the claims of 'competitive white-box attack success rates' and 'consistently improves transferability' are presented without any numerical values, baseline comparisons, statistical significance tests, or ablation controls, making it impossible to assess whether the entropy-maximization component is responsible for the reported gains.
minor comments (1)
  1. [Abstract] The acronym UJEM-KL is introduced in the abstract without expansion or a forward reference to its definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the attribution of poor transferability to 'overly constrained optimization objectives' is load-bearing for the paper's motivation and conclusion, yet rests on a preliminary experiment whose details (prompt count, models, quantitative entropy distributions, or probability-mass comparisons) are not reported, leaving the key assumption unverified.

    Authors: We agree that the abstract should make the preliminary experiment verifiable. The full manuscript (Section 3.1) reports the experiment using 100 prompts sampled from the two safety benchmarks across the three VLMs, with quantitative entropy distributions and top-k probability mass comparisons provided in Figure 2 and the associated text. These show refusal concentrating at higher-entropy positions while non-refusal tokens already occupy substantial mass in the top candidates. We will revise the abstract to concisely include the prompt count, models, and key quantitative observations, with a pointer to Section 3.1 and Figure 2. revision: yes

  2. Referee: [Abstract] Abstract / Experimental Results: the claims of 'competitive white-box attack success rates' and 'consistently improves transferability' are presented without any numerical values, baseline comparisons, statistical significance tests, or ablation controls, making it impossible to assess whether the entropy-maximization component is responsible for the reported gains.

    Authors: The abstract is intentionally high-level, but the full manuscript (Section 4 and Tables 1-4) provides the requested quantitative details: white-box ASR values competitive with prior methods, transferability gains across models, direct baseline comparisons (including fixed-target gradient attacks), ablation studies isolating the entropy-maximization term, and statistical significance via repeated runs with t-tests. We will update the abstract to report representative numerical results and explicitly reference the ablations and controls so readers can immediately evaluate the contribution of the entropy-maximization component. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper motivates UJEM-KL from a preliminary empirical observation that refusal concentrates at high-entropy tokens and then demonstrates improved transferability via experiments on three VLMs. No equation or result is shown to reduce by construction to a fitted parameter, self-citation, or renamed input; the attribution of limited transferability to constrained objectives follows from comparative attack success rates rather than tautological redefinition. This is a standard empirical workflow with no load-bearing circular step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on an empirical observation about entropy concentration in autoregressive decoding; no explicit free parameters, mathematical axioms, or new postulated entities are introduced in the provided abstract.

pith-pipeline@v0.9.0 · 5500 in / 946 out tokens · 38549 ms · 2026-05-12T03:32:55.001710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 10 internal anchors

  1. [1]

    In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

    Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zis- serman, A., Simonyan, K.: Flaming...

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. CoRRabs/2502.13923(2025).https: //doi.org/10.48550/ARXIV.2502.13923,http...

  3. [3]

    (eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/paper/2023/hash/ c1f0b856a35986348ab3414177266f75-Abstract-Conference.html

    Carlini, N., Nasr, M., Choquette-Choo, C.A., Jagielski, M., Gao, I., Koh, P.W., Ippolito, D., Tramèr, F., Schmidt, L.: Are aligned neural networks adversarially aligned? In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/paper/2023/hash/ c1f0b856a35986348ab3414177266f75-Abstrac...

  4. [4]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.C.H.: Instructblip: Towards general-purpose vision-language models with instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/paper/ 2023/hash/9a6a435e75419a836fe47ab6793623e6-...

  5. [5]

    In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

    Gan, Z., Chen, Y., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial train- ing for vision-and-language representation learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020),https://proceedings. neurips.cc/paper/2020/hash/49562478de4c54fafd4ec46fdb297de5-Abstract. html

  6. [6]

    URL https://doi.org/10.1609/aaai.v39i22.34568

    Gong, Y., Ran, D., Liu, J., Wang, C., Cong, T., Wang, A., Duan, S., Wang, X.: Figstep: Jailbreaking large vision-language models via typographic visual prompts. In: Walsh, T., Shah, J., Kolter, Z. (eds.) AAAI. pp. 23951–23959. AAAI Press (2025).https://doi.org/10.1609/AAAI.V39I22.34568,https://doi.org/10. 1609/aaai.v39i22.34568

  7. [7]

    Explaining and Harnessing Adversarial Examples

    Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015),http://arxiv.org/abs/ 1412.6572

  8. [8]

    In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F

    Guo, X., Yu, F., Zhang, H., Qin, L., Hu, B.: Cold-attack: Jailbreaking llms with stealthiness and controllability. In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) ICML. vol. 235, pp. 16974–17002. PMLR / OpenReview.net (2024),https://proceedings.mlr. press/v235/guo24i.html

  9. [9]

    High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models

    He, M., Tian, X., Shen, X., Ni, J., Zou, S., Yang, Z., Zhang, J.: Few tokens mat- ter: Entropy guided attacks on vision-language models. CoRRabs/2512.21815 (2025).https://doi.org/10.48550/ARXIV.2512.21815,https://doi.org/10. 48550/arXiv.2512.21815

  10. [10]

    CoRRabs/2510.02999(2025).https://doi.org/ 10.48550/ARXIV.2510.02999,https://doi.org/10.48550/arXiv.2510.02999 Title Suppressed Due to Excessive Length 25

    Huang, X., Hu, W., Zheng, T., Xiu, K., Jia, X., Wang, D., Qin, Z., Ren, K.: Untargeted jailbreak attack. CoRRabs/2510.02999(2025).https://doi.org/ 10.48550/ARXIV.2510.02999,https://doi.org/10.48550/arXiv.2510.02999 Title Suppressed Due to Excessive Length 25

  11. [11]

    In: ICLR

    Huang, Y., Gupta, S., Xia, M., Li, K., Chen, D.: Catastrophic jailbreak of open- source llms via exploiting generation. In: ICLR. OpenReview.net (2024),https: //openreview.net/forum?id=r42tSSCHPh

  12. [12]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., Khabsa, M.: Llama guard: Llm-based input- output safeguard for human-ai conversations. CoRRabs/2312.06674(2023). https://doi.org/10.48550/ARXIV.2312.06674,https://doi.org/10.48550/ arXiv.2312.06674

  13. [13]

    (eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/ paper / 2023 / hash / 5abcdf8ecdcacba028c6662789194572 - Abstract - Datasets _ and_Benchmarks.html

    Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicineinoneday.In:Oh,A.,Naumann,T.,Globerson,A.,Saenko,K.,Hardt, M., Levine, S. (eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/ paper / 2023 / hash / 5abcdf8ecdcacba028c6662789194572 - Abs...

  14. [14]

    In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

    Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) ICML. Pro- ceedings of Machine Learning Research, vol. 202, pp. 19730–19742. PMLR (2023), https://proceedings.mlr.press/v2...

  15. [15]

    CoRR abs/2509.21029(2025).https : / / doi

    Lin, R., Paren, A., Yuan, S., Li, M., Torr, P., Bibi, A., Liu, T.: FORCE: trans- ferable visual jailbreaking attacks via feature over-reliance correction. CoRR abs/2509.21029(2025).https : / / doi . org / 10 . 48550 / ARXIV . 2509 . 21029, https://doi.org/10.48550/arXiv.2509.21029

  16. [16]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) NeurIPS (2023),http : / / papers . nips . cc / paper _ files / paper / 2023 / hash / 6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html

  17. [17]

    Prompt Injection attack against LLM-integrated Applications

    Liu, Y., Deng, G., Li, Y., Wang, K., Zhang, T., Liu, Y., Wang, H., Zheng, Y., Liu, Y.: Prompt injection attack against llm-integrated applications. CoRR abs/2306.05499(2023).https : / / doi . org / 10 . 48550 / ARXIV . 2306 . 05499, https://doi.org/10.48550/arXiv.2306.05499

  18. [18]

    Tao, Z., Lin, T., Chen, X., Li, H., Wu, Y ., Li, Y ., Jin, Z., Huang, F., Tao, D., and Zhou, J

    Luo, W., Ma, S., Liu, X., Guo, X., Xiao, C.: Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak at- tacks. CoRRabs/2404.03027(2024).https://doi.org/10.48550/ARXIV.2404. 03027,https://doi.org/10.48550/arXiv.2404.03027

  19. [19]

    In: ICLR

    Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR. OpenReview.net (2018),https: //openreview.net/forum?id=rJzIBfZAb

  20. [20]

    In: International Conference on Learning Representations (2018)

    Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations (2018)

  21. [21]

    In: Salakhutdi- nov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F

    Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D.A., Hendrycks, D.: Harmbench: A standardized eval- uation framework for automated red teaming and robust refusal. In: Salakhutdi- nov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) ICML. Proceedings of ...

  22. [22]

    CoRRabs/2509.21401(2025).https://doi.org/ 10.48550/ARXIV.2509.21401,https://doi.org/10.48550/arXiv.2509.21401 26 No Author Given

    Mia, M.J., Amini, M.H.: Jailip: Jailbreaking vision-language models via loss guided image perturbation. CoRRabs/2509.21401(2025).https://doi.org/ 10.48550/ARXIV.2509.21401,https://doi.org/10.48550/arXiv.2509.21401 26 No Author Given

  23. [23]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Oh, S., Jin, Y., Sharma, M., Kim, D., Ma, E., Verma, G., Kumar, S.: Uniguard: To- wardsuniversalsafetyguardrailsforjailbreakattacksonmultimodallargelanguage models. CoRRabs/2411.01703(2024).https://doi.org/10.48550/ARXIV. 2411.01703,https://doi.org/10.48550/arXiv.2411.01703

  24. [24]

    GPT-4 Technical Report

    OpenAI: GPT-4 technical report. CoRRabs/2303.08774(2023).https://doi. org/10.48550/ARXIV.2303.08774,https://doi.org/10.48550/arXiv.2303. 08774

  25. [25]

    GPT-4o System Card

    OpenAI: Gpt-4o system card. CoRRabs/2410.21276(2024).https://doi.org/ 10.48550/ARXIV.2410.21276,https://doi.org/10.48550/arXiv.2410.21276

  26. [26]

    In: Wooldridge, M.J., Dy, J.G., Natarajan, S

    Qi, X., Huang, K., Panda, A., Henderson, P., Wang, M., Mittal, P.: Visual ad- versarial examples jailbreak aligned large language models. In: Wooldridge, M.J., Dy, J.G., Natarajan, S. (eds.) AAAI. pp. 21527–21536. AAAI Press (2024). https://doi.org/10.1609/AAAI.V38I19.30150,https://doi.org/10.1609/ aaai.v38i19.30150

  27. [27]

    CoRRabs/2410.03489(2024)

    Rando, J., Korevaar, H., Brinkman, E., Evtimov, I., Tramèr, F.: Gradient-based jailbreak images for multimodal fusion models. CoRRabs/2410.03489(2024). https://doi.org/10.48550/ARXIV.2410.03489,https://doi.org/10.48550/ arXiv.2410.03489

  28. [28]

    CoRRabs/2505.21967(2025).https://doi.org/ 10.48550/ARXIV.2505.21967,https://doi.org/10.48550/arXiv.2505.21967

    Ren, J., Dras, M., Naseem, U.: Seeing the threat: Vulnerabilities in vision-language models to adversarial attack. CoRRabs/2505.21967(2025).https://doi.org/ 10.48550/ARXIV.2505.21967,https://doi.org/10.48550/arXiv.2505.21967

  29. [29]

    In: ICLR

    Schaeffer, R., Valentine, D., Bailey, L., Chua, J., Eyzaguirre, C., Durante, Z., Benton, J., Miranda, B., Sleight, H., Wang, T.T., Hughes, J., Agrawal, R., Sharma, M., Emmons, S., Koyejo, S., Perez, E.: Failures to find transferable im- age jailbreaks between vision-language models. In: ICLR. OpenReview.net (2025), https://openreview.net/forum?id=wvFnqVVUhN

  30. [30]

    Education Sciences16(1) (2026).https://doi.org/10.3390/ educsci16010123,https://www.mdpi.com/2227-7102/16/1/123

    Tian, J.: Vision-language models in teaching and learning: A systematic lit- erature review. Education Sciences16(1) (2026).https://doi.org/10.3390/ educsci16010123,https://www.mdpi.com/2227-7102/16/1/123

  31. [31]

    CoRR abs/2508.01741(2025).https : / / doi

    Wang, R., Wang, X., Yao, Y., Tong, X., Ma, X.: Simulated ensemble at- tack: Transferring jailbreaks across fine-tuned vision-language models. CoRR abs/2508.01741(2025).https : / / doi . org / 10 . 48550 / ARXIV . 2508 . 01741, https://doi.org/10.48550/arXiv.2508.01741

  32. [32]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...

  33. [33]

    Assran, Q

    Wang, Z., Yang, H., Feng, Y., Sun, P., Guo, H., Zhang, Z., Ren, K.: Towards trans- ferable targeted adversarial examples. In: CVPR. pp. 20534–20543. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.01967,https://doi.org/10.1109/ CVPR52729.2023.01967 Title Suppressed Due to Excessive Length 27

  34. [34]

    In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023

    Waseda, F., Nishikawa, S., Le, T., Nguyen, H.H., Echizen, I.: Closer look at the transferability of adversarial examples: How they fool different models dif- ferently. In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023. pp. 1360–1368. IEEE (2023). https://doi.org/10.1109/WACV56688.2023.00141,...

  35. [35]

    In: Ku, L., Mar- tins, A., Srikumar, V

    Xu, Z., Jiang, F., Niu, L., Jia, J., Lin, B.Y., Poovendran, R.: Safedecoding: Defending against jailbreak attacks via safety-aware decoding. In: Ku, L., Mar- tins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024. pp. 558...

  36. [36]

    What’s in the image? a deep-dive into the vision of vision language models

    Xu, Z., Bai, Y., Zhang, Y., Li, Z., Xia, F., Wong, K.K., Wang, J., Zhao, H.: Drivegpt4-v2: Harnessing large language model capabilities for enhanced closed- loop autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. pp. 17261– 17270. Computer Vision Foundation / IEEE (202...

  37. [37]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Yang, J., Zhang, Z., Cui, S., Wang, H., Huang, M.: Guiding not forcing: En- hancing the transferability of jailbreaking attacks on llms via removing super- fluous constraints. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 202...

  38. [38]

    Ying, Z., Liu, A., Liang, S., Huang, L., Guo, J., Zhou, W., Liu, X., Tao, D.: Safebench: A safety evaluation framework for multimodal large language models. Int. J. Comput. Vis.134(1), 18 (2026).https://doi.org/10.1007/S11263-025- 02613-1,https://doi.org/10.1007/s11263-025-02613-1

  39. [39]

    In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    Yoon, S., Jeung, W., No, A.: R-TOFU: unlearning in large reasoning models. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025. pp. 5239–5258. Association for Computational Linguistics (2025).https://doi....

  40. [40]

    CoRRabs/2502.21059(2025)

    Zhang, Z., Sun, Z., Zhang, Z., Guo, J., He, X.: Fc-attack: Jailbreaking large vision- language models via auto-generated flowcharts. CoRRabs/2502.21059(2025). https://doi.org/10.48550/ARXIV.2502.21059,https://doi.org/10.48550/ arXiv.2502.21059

  41. [41]

    In: Tan, J., Toussaint, M., Darvish, K

    Zitkovich,B.,Yu,T.,Xu,S.,Xu,P.,Xiao,T.,Xia,F.,Wu,J.,Wohlhart,P.,Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H.T., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.E., Leal, I., Kuang, Y., Kalashnikov, D.,...

  42. [42]

    PMLR (2023),https://proceedings.mlr.press/v229/zitkovich23a.html

  43. [43]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable ad- versarial attacks on aligned language models. CoRRabs/2307.15043(2023). https://doi.org/10.48550/ARXIV.2307.15043,https://doi.org/10.48550/ arXiv.2307.15043