RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion

Ruofan Wang; Xingjun Ma

arxiv: 2503.06223 · v5 · submitted 2025-03-08 · 💻 cs.CV

RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion

Ruofan Wang , Xingjun Ma This is my paper

Pith reviewed 2026-05-23 00:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal safetyvision-language modelsdiffusion modelssafety auditingreinforcement optimizationcontextual failuresVLM vulnerabilitiesblack-box testing

0 comments

The pith

VLMs produce more unsafe responses when partial toxic text is paired with certain generated visual contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that safety alignments in vision-language models break down under harmful visual contexts that accompany text inputs, a risk missed by evaluations that focus only on explicit malicious instructions. This matters because deployed VLMs receive multimodal inputs where images can steer outputs toward unsafe behavior. RedDiffuser addresses the gap by using a reinforcement-based method with diffusion models to systematically generate coherent visuals that trigger these failures in black-box settings. Experiments across open and commercial models show the failures are widespread and remain even when external guardrails are applied.

Core claim

Context-conditioned safety failures are widespread in VLMs; RedDiffuser, which combines greedy prompt search with reinforcement optimization to generate semantically coherent visual inputs via diffusion models, uncovers high-risk multimodal inputs that increase unsafe response rates by up to 10.69 percent on LLaVA and 8.91 percent on a hold-out set, with transferability to Gemini and LLaMA-Vision even under guardrails.

What carries the argument

RedDiffuser, a reinforcement-based framework that leverages diffusion models to generate semantically coherent visual inputs for black-box safety testing under harmful contextual exposure.

If this is right

Current system-level safety mechanisms remain insufficient for realistic multimodal risks.
Vulnerabilities transfer across models, from LLaVA to Gemini and LLaMA-Vision.
Text-only auditing is insufficient because visual context can substantially steer model behavior.
Context-aware multimodal auditing is required to diagnose hidden vulnerabilities in VLM systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to audit safety failures involving other partial-harm signals, such as audio or video clips paired with text.
Safety training for VLMs may need to include optimization over visual contexts rather than text alone.
Benchmarks that mix partial toxicity across modalities could become standard for evaluating deployed systems.

Load-bearing premise

The generated visual inputs are semantically coherent and the observed increases in unsafe responses are caused by the visual context rather than artifacts from the diffusion or optimization process itself.

What would settle it

Running the same LLaVA experiments with random or non-optimized images from the same diffusion model and finding no comparable rise in unsafe response rates would indicate the effect depends on the specific visuals selected by RedDiffuser.

Figures

Figures reproduced from arXiv: 2503.06223 by Ruofan Wang, Xingjun Ma.

**Figure 2.** Figure 2: RedDiffuser overview. Given an incomplete toxic sentence, Gemini selects an image prompt via greedy search. A diffusion model generates an image, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of images generated by the general-purpose Stable [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Large Vision-Language Models (VLMs) are increasingly deployed in open-ended environments, where ensuring reliable safety under multimodal inputs is critical. However, existing evaluations remain largely instruction-centric, focusing on explicit malicious queries while overlooking a more realistic and underexplored risk: whether safety alignment remains robust under harmful contextual exposure. This limitation is particularly important for multimodal systems, where visual inputs can substantially steer model behavior and render text-only auditing insufficient. In this work, we study multimodal safety auditing under harmful contextual exposure, asking whether VLMs can maintain safe behavior when partial toxic text is paired with visual context. To enable systematic auditing, we propose RedDiffuser (RedDiff), a reinforcement-based framework that leverages diffusion models to generate semantically coherent visual inputs for black-box safety testing. By combining greedy prompt search with reinforcement optimization, RedDiffuser uncovers high-risk multimodal inputs that expose latent safety failures. Extensive experiments on both open-source and commercial VLMs show that such context-conditioned failures are widespread. On LLaVA, RedDiffuser increases unsafe response rates by up to 10.69% on the original set and 8.91% on a hold-out set, with strong transferability to Gemini and LLaMA-Vision. These vulnerabilities persist even under external safety guardrails, suggesting that current system-level safety mechanisms remain insufficient for realistic multimodal risks. Our findings reveal a critical blind spot in existing safety evaluations and establish context-aware multimodal auditing as an essential paradigm for diagnosing hidden vulnerabilities in modern VLM systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces RedDiffuser, a reinforcement-based auditing framework that combines greedy prompt search with diffusion model optimization to generate semantically coherent visual inputs paired with partial toxic text. It claims this exposes widespread context-conditioned safety failures in VLMs, reporting unsafe response rate increases of up to 10.69% on LLaVA (original set) and 8.91% (hold-out set), with transferability to Gemini and LLaMA-Vision, and persistence under external guardrails.

Significance. If the results hold after verification, the work would establish context-aware multimodal auditing as a necessary complement to text-only evaluations, revealing a blind spot in VLM safety alignments. The reinforced diffusion approach offers a systematic, black-box method for generating high-risk test cases that could inform more robust alignment techniques.

major comments (2)

[Abstract and experimental results section] Abstract and experimental results section: The central claim attributes the reported unsafe response increases (up to 10.69% on LLaVA) to harmful visual context, but provides no quantitative validation such as CLIP similarity to toxic prompts, human coherence ratings, or ablations against non-optimized diffusion samples. This is load-bearing for interpreting the percentages as evidence of multimodal safety failures rather than diffusion/optimization artifacts.
[Methods and experimental results] Methods and experimental results: The abstract reports specific percentage increases, transferability, and persistence under guardrails with no details on unsafe response measurement criteria, baseline comparisons, statistical significance testing, or controls for confounding factors in the optimization process. These omissions directly affect the soundness of the quantitative claims.

minor comments (1)

The abstract would be clearer with explicit mention of the total number of VLMs evaluated and the size of the hold-out set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger validation and experimental transparency. We address each major comment below and have revised the manuscript to incorporate the suggested additions.

read point-by-point responses

Referee: [Abstract and experimental results section] Abstract and experimental results section: The central claim attributes the reported unsafe response increases (up to 10.69% on LLaVA) to harmful visual context, but provides no quantitative validation such as CLIP similarity to toxic prompts, human coherence ratings, or ablations against non-optimized diffusion samples. This is load-bearing for interpreting the percentages as evidence of multimodal safety failures rather than diffusion/optimization artifacts.

Authors: We agree that quantitative validation is essential to attribute the unsafe response increases specifically to harmful visual context. In the revised manuscript we have added CLIP similarity scores between generated images and toxic prompts, human coherence and relevance ratings on sampled outputs (with reported inter-rater agreement), and ablation comparisons against non-optimized diffusion samples. These results are now presented in the experimental results section and support the interpretation that the observed increases reflect context-conditioned safety failures. revision: yes
Referee: [Methods and experimental results] Methods and experimental results: The abstract reports specific percentage increases, transferability, and persistence under guardrails with no details on unsafe response measurement criteria, baseline comparisons, statistical significance testing, or controls for confounding factors in the optimization process. These omissions directly affect the soundness of the quantitative claims.

Authors: We acknowledge these omissions affect interpretability. The revised manuscript now details the unsafe response measurement protocol (hybrid keyword and LLM-judge approach with agreement statistics), includes baseline comparisons to text-only and non-reinforced diffusion conditions, reports statistical significance testing (paired tests with p-values), and describes controls for optimization confounders such as iteration count and prompt length. These additions appear in the Methods and Experimental Results sections. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical auditing framework with independent experimental results

full rationale

The paper proposes RedDiffuser as a reinforcement-based diffusion framework for generating visual contexts to audit VLMs and reports measured increases in unsafe response rates from experiments on models like LLaVA. No equations, derivations, or self-referential definitions appear in the provided text that would reduce any claimed outcome to a fitted parameter or input by construction. The central results are empirical measurements on external VLMs, which remain falsifiable and independent of the method's internal definition. No self-citation load-bearing steps or uniqueness theorems are invoked. This is a standard empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be extracted; the method likely depends on unstated hyperparameters in the reinforcement learning loop and diffusion sampling, but these are not detailed.

pith-pipeline@v0.9.0 · 5802 in / 1313 out tokens · 48647 ms · 2026-05-23T00:08:42.053717+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauthet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking mul- timodal large language models,

Y . Li, H. Guo, K. Zhou, W. X. Zhao, and J.-R. Wen, “Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking mul- timodal large language models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 174–189

work page 2024
[4]

Visual adversarial examples jailbreak aligned large language models,

X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal, “Visual adversarial examples jailbreak aligned large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 19, 2024, pp. 21 527–21 536

work page 2024
[5]

White- box multimodal jailbreaks against large vision-language models,

R. Wang, X. Ma, H. Zhou, C. Ji, G. Ye, and Y .-G. Jiang, “White- box multimodal jailbreaks against large vision-language models,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6920–6928

work page 2024
[6]

Diffusion models for adversarial purifi- cation.arXiv preprint arXiv:2205.07460,

W. Nie, B. Guo, Y . Huang, C. Xiao, A. Vahdat, and A. Anandku- mar, “Diffusion models for adversarial purification,”arXiv preprint arXiv:2205.07460, 2022

work page arXiv 2022
[7]

A mutation-based method for multi-modal jailbreaking attack detection,

X. Zhang, C. Zhang, T. Li, Y . Huang, X. Jia, X. Xie, Y . Liu, and C. Shen, “A mutation-based method for multi-modal jailbreaking attack detection,”arXiv preprint arXiv:2312.10766, 2023

work page arXiv 2023
[8]

Failures to find transferable image jailbreaks between vision-language models,

R. Schaeffer, D. Valentine, L. Bailey, J. Chua, Z. Durante, C. Eyzaguirre, J. Benton, B. Miranda, H. Sleight, T. T. Wanget al., “Failures to find transferable image jailbreaks between vision-language models,” in Workshop on Socially Responsible Language Modelling Research

work page
[9]

Fig- 8 Step: Jailbreaking Large Vision-language Models via Typo- graphic Visual Prompts

Y . Gong, D. Ran, J. Liu, C. Wang, T. Cong, A. Wang, S. Duan, and X. Wang, “Figstep: Jailbreaking large vision-language models via typographic visual prompts,”arXiv preprint arXiv:2311.05608, 2023

work page arXiv 2023
[10]

Query-relevant images jailbreak large multi-modal models,

X. Liu, Y . Zhu, Y . Lan, C. Yang, and Y . Qiao, “Query-relevant images jailbreak large multi-modal models,”arXiv preprint arXiv:2311.17600, 2023

work page arXiv 2023
[11]

Ideator: Jailbreaking and benchmarking large vision- language models using themselves,

R. Wang, J. Li, Y . Wang, B. Wang, X. Wang, Y . Teng, Y . Wang, X. Ma, and Y .-G. Jiang, “Ideator: Jailbreaking and benchmarking large vision- language models using themselves,”arXiv preprint arXiv:2411.00827, 2024

work page arXiv 2024
[12]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022
[13]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024

work page 2024
[14]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

work page 2021
[15]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[18]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalezet al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), 2023

work page 2023
[19]

Instructblip: Towards general-purpose vision- language models with instruction tuning,

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” 2023

work page 2023
[20]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,”arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Survey of vulnerabilities in large language models revealed by adversarial attacks.arXiv preprint arXiv:2310.10844, 2023

E. Shayegani, M. A. A. Mamun, Y . Fu, P. Zaree, Y . Dong, and N. Abu- Ghazaleh, “Survey of vulnerabilities in large language models revealed by adversarial attacks,”arXiv preprint arXiv:2310.10844, 2023

work page arXiv 2023
[22]

A survey of attacks on large vision- language models: Resources, advances, and future trends

D. Liu, M. Yang, X. Qu, P. Zhou, W. Hu, and Y . Cheng, “A survey of attacks on large vision-language models: Resources, advances, and future trends,”arXiv preprint arXiv:2407.07403, 2024

work page arXiv 2024
[23]

Privacy in large language models: Attacks, defenses and future directions,

H. Li, Y . Chen, J. Luo, J. Wang, H. Peng, Y . Kang, X. Zhang, Q. Hu, C. Chan, Z. Xuet al., “Privacy in large language models: Attacks, defenses and future directions,”arXiv preprint arXiv:2310.10383, 2023

work page arXiv 2023
[24]

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

X. Ma, Y . Gao, Y . Wang, R. Wang, X. Wang, Y . Sun, Y . Ding, H. Xu, Y . Chen, Y . Zhaoet al., “Safety at scale: A comprehensive survey of large model safety,”arXiv preprint arXiv:2502.05206, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

(ab) using images and sounds for indirect instruction injection in multi-modal llms,

E. Bagdasaryan, T.-Y . Hsieh, B. Nassi, and V . Shmatikov, “(ab) using images and sounds for indirect instruction injection in multi-modal llms,” arXiv preprint arXiv:2307.10490, 2023

work page arXiv 2023
[26]

Image hijacks: Adversarial images can control generative models at runtime

L. Bailey, E. Ong, S. Russell, and S. Emmons, “Image hijacks: Adver- sarial images can control generative models at runtime,”arXiv preprint arXiv:2309.00236, 2023

work page arXiv 2023
[27]

Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models,

E. Shayegani, Y . Dong, and N. Abu-Ghazaleh, “Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models,” inThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[28]

Are aligned neural networks adversarially aligned?

N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, I. Gao, P. W. W. Koh, D. Ippolito, F. Tramer, and L. Schmidt, “Are aligned neural networks adversarially aligned?”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[29]

Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

Z. Niu, H. Ren, X. Gao, G. Hua, and R. Jin, “Jailbreaking attack against multimodal large language model,”arXiv preprint arXiv:2402.02309, 2024

work page arXiv 2024
[30]

Visual- roleplay: Universal jailbreak attack on multimodal large language mod- els via role-playing image character,

S. Ma, W. Luo, Y . Wang, X. Liu, M. Chen, B. Li, and C. Xiao, “Visual- roleplay: Universal jailbreak attack on multimodal large language mod- els via role-playing image characte,”arXiv preprint arXiv:2405.20773, 2024

work page arXiv 2024
[31]

Training Diffusion Models with Reinforcement Learning

K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine, “Train- ing diffusion models with reinforcement learning,”arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Approximately optimal approximate re- inforcement learning,

S. Kakade and J. Langford, “Approximately optimal approximate re- inforcement learning,” inProceedings of the Nineteenth International Conference on Machine Learning, 2002, pp. 267–274

work page 2002
[33]

Language Models are Few-Shot Learners

T. B. Brown, “Language models are few-shot learners,”arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[34]

Detoxify,

L. Hanu and Unitary team, “Detoxify,” Github. https://github.com/unitaryai/detoxify, 2020

work page 2020
[35]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[36]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,”arXiv preprint arXiv:2009.11462, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[37]

Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp,

T. Schick, S. Udupa, and H. Sch ¨utze, “Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 1408–1424, 2021

work page 2021
[38]

Robust con- versational agents against imperceptible toxicity triggers,

N. Mehrabi, A. Beirami, F. Morstatter, and A. Galstyan, “Robust con- versational agents against imperceptible toxicity triggers,”arXiv preprint arXiv:2205.02392, 2022

work page arXiv 2022
[39]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y . Choi, “Clipscore: A reference-free evaluation metric for image captioning,”arXiv preprint arXiv:2104.08718, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

Scaling laws for reward model overoptimization,

L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model overoptimization,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 10 835–10 866

work page 2023

[1] [1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauthet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking mul- timodal large language models,

Y . Li, H. Guo, K. Zhou, W. X. Zhao, and J.-R. Wen, “Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking mul- timodal large language models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 174–189

work page 2024

[4] [4]

Visual adversarial examples jailbreak aligned large language models,

X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal, “Visual adversarial examples jailbreak aligned large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 19, 2024, pp. 21 527–21 536

work page 2024

[5] [5]

White- box multimodal jailbreaks against large vision-language models,

R. Wang, X. Ma, H. Zhou, C. Ji, G. Ye, and Y .-G. Jiang, “White- box multimodal jailbreaks against large vision-language models,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6920–6928

work page 2024

[6] [6]

Diffusion models for adversarial purifi- cation.arXiv preprint arXiv:2205.07460,

W. Nie, B. Guo, Y . Huang, C. Xiao, A. Vahdat, and A. Anandku- mar, “Diffusion models for adversarial purification,”arXiv preprint arXiv:2205.07460, 2022

work page arXiv 2022

[7] [7]

A mutation-based method for multi-modal jailbreaking attack detection,

X. Zhang, C. Zhang, T. Li, Y . Huang, X. Jia, X. Xie, Y . Liu, and C. Shen, “A mutation-based method for multi-modal jailbreaking attack detection,”arXiv preprint arXiv:2312.10766, 2023

work page arXiv 2023

[8] [8]

Failures to find transferable image jailbreaks between vision-language models,

R. Schaeffer, D. Valentine, L. Bailey, J. Chua, Z. Durante, C. Eyzaguirre, J. Benton, B. Miranda, H. Sleight, T. T. Wanget al., “Failures to find transferable image jailbreaks between vision-language models,” in Workshop on Socially Responsible Language Modelling Research

work page

[9] [9]

Fig- 8 Step: Jailbreaking Large Vision-language Models via Typo- graphic Visual Prompts

Y . Gong, D. Ran, J. Liu, C. Wang, T. Cong, A. Wang, S. Duan, and X. Wang, “Figstep: Jailbreaking large vision-language models via typographic visual prompts,”arXiv preprint arXiv:2311.05608, 2023

work page arXiv 2023

[10] [10]

Query-relevant images jailbreak large multi-modal models,

X. Liu, Y . Zhu, Y . Lan, C. Yang, and Y . Qiao, “Query-relevant images jailbreak large multi-modal models,”arXiv preprint arXiv:2311.17600, 2023

work page arXiv 2023

[11] [11]

Ideator: Jailbreaking and benchmarking large vision- language models using themselves,

R. Wang, J. Li, Y . Wang, B. Wang, X. Wang, Y . Teng, Y . Wang, X. Ma, and Y .-G. Jiang, “Ideator: Jailbreaking and benchmarking large vision- language models using themselves,”arXiv preprint arXiv:2411.00827, 2024

work page arXiv 2024

[12] [12]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022

[13] [13]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024

work page 2024

[14] [14]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

work page 2021

[15] [15]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[18] [18]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalezet al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), 2023

work page 2023

[19] [19]

Instructblip: Towards general-purpose vision- language models with instruction tuning,

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” 2023

work page 2023

[20] [20]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,”arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Survey of vulnerabilities in large language models revealed by adversarial attacks.arXiv preprint arXiv:2310.10844, 2023

E. Shayegani, M. A. A. Mamun, Y . Fu, P. Zaree, Y . Dong, and N. Abu- Ghazaleh, “Survey of vulnerabilities in large language models revealed by adversarial attacks,”arXiv preprint arXiv:2310.10844, 2023

work page arXiv 2023

[22] [22]

A survey of attacks on large vision- language models: Resources, advances, and future trends

D. Liu, M. Yang, X. Qu, P. Zhou, W. Hu, and Y . Cheng, “A survey of attacks on large vision-language models: Resources, advances, and future trends,”arXiv preprint arXiv:2407.07403, 2024

work page arXiv 2024

[23] [23]

Privacy in large language models: Attacks, defenses and future directions,

H. Li, Y . Chen, J. Luo, J. Wang, H. Peng, Y . Kang, X. Zhang, Q. Hu, C. Chan, Z. Xuet al., “Privacy in large language models: Attacks, defenses and future directions,”arXiv preprint arXiv:2310.10383, 2023

work page arXiv 2023

[24] [24]

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

X. Ma, Y . Gao, Y . Wang, R. Wang, X. Wang, Y . Sun, Y . Ding, H. Xu, Y . Chen, Y . Zhaoet al., “Safety at scale: A comprehensive survey of large model safety,”arXiv preprint arXiv:2502.05206, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

(ab) using images and sounds for indirect instruction injection in multi-modal llms,

E. Bagdasaryan, T.-Y . Hsieh, B. Nassi, and V . Shmatikov, “(ab) using images and sounds for indirect instruction injection in multi-modal llms,” arXiv preprint arXiv:2307.10490, 2023

work page arXiv 2023

[26] [26]

Image hijacks: Adversarial images can control generative models at runtime

L. Bailey, E. Ong, S. Russell, and S. Emmons, “Image hijacks: Adver- sarial images can control generative models at runtime,”arXiv preprint arXiv:2309.00236, 2023

work page arXiv 2023

[27] [27]

Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models,

E. Shayegani, Y . Dong, and N. Abu-Ghazaleh, “Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models,” inThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[28] [28]

Are aligned neural networks adversarially aligned?

N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, I. Gao, P. W. W. Koh, D. Ippolito, F. Tramer, and L. Schmidt, “Are aligned neural networks adversarially aligned?”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[29] [29]

Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

Z. Niu, H. Ren, X. Gao, G. Hua, and R. Jin, “Jailbreaking attack against multimodal large language model,”arXiv preprint arXiv:2402.02309, 2024

work page arXiv 2024

[30] [30]

Visual- roleplay: Universal jailbreak attack on multimodal large language mod- els via role-playing image character,

S. Ma, W. Luo, Y . Wang, X. Liu, M. Chen, B. Li, and C. Xiao, “Visual- roleplay: Universal jailbreak attack on multimodal large language mod- els via role-playing image characte,”arXiv preprint arXiv:2405.20773, 2024

work page arXiv 2024

[31] [31]

Training Diffusion Models with Reinforcement Learning

K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine, “Train- ing diffusion models with reinforcement learning,”arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Approximately optimal approximate re- inforcement learning,

S. Kakade and J. Langford, “Approximately optimal approximate re- inforcement learning,” inProceedings of the Nineteenth International Conference on Machine Learning, 2002, pp. 267–274

work page 2002

[33] [33]

Language Models are Few-Shot Learners

T. B. Brown, “Language models are few-shot learners,”arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[34] [34]

Detoxify,

L. Hanu and Unitary team, “Detoxify,” Github. https://github.com/unitaryai/detoxify, 2020

work page 2020

[35] [35]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[36] [36]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,”arXiv preprint arXiv:2009.11462, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[37] [37]

Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp,

T. Schick, S. Udupa, and H. Sch ¨utze, “Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 1408–1424, 2021

work page 2021

[38] [38]

Robust con- versational agents against imperceptible toxicity triggers,

N. Mehrabi, A. Beirami, F. Morstatter, and A. Galstyan, “Robust con- versational agents against imperceptible toxicity triggers,”arXiv preprint arXiv:2205.02392, 2022

work page arXiv 2022

[39] [39]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y . Choi, “Clipscore: A reference-free evaluation metric for image captioning,”arXiv preprint arXiv:2104.08718, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[40] [40]

Scaling laws for reward model overoptimization,

L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model overoptimization,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 10 835–10 866

work page 2023