RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion
Pith reviewed 2026-05-23 00:08 UTC · model grok-4.3
The pith
VLMs produce more unsafe responses when partial toxic text is paired with certain generated visual contexts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Context-conditioned safety failures are widespread in VLMs; RedDiffuser, which combines greedy prompt search with reinforcement optimization to generate semantically coherent visual inputs via diffusion models, uncovers high-risk multimodal inputs that increase unsafe response rates by up to 10.69 percent on LLaVA and 8.91 percent on a hold-out set, with transferability to Gemini and LLaMA-Vision even under guardrails.
What carries the argument
RedDiffuser, a reinforcement-based framework that leverages diffusion models to generate semantically coherent visual inputs for black-box safety testing under harmful contextual exposure.
If this is right
- Current system-level safety mechanisms remain insufficient for realistic multimodal risks.
- Vulnerabilities transfer across models, from LLaVA to Gemini and LLaMA-Vision.
- Text-only auditing is insufficient because visual context can substantially steer model behavior.
- Context-aware multimodal auditing is required to diagnose hidden vulnerabilities in VLM systems.
Where Pith is reading between the lines
- The approach could be extended to audit safety failures involving other partial-harm signals, such as audio or video clips paired with text.
- Safety training for VLMs may need to include optimization over visual contexts rather than text alone.
- Benchmarks that mix partial toxicity across modalities could become standard for evaluating deployed systems.
Load-bearing premise
The generated visual inputs are semantically coherent and the observed increases in unsafe responses are caused by the visual context rather than artifacts from the diffusion or optimization process itself.
What would settle it
Running the same LLaVA experiments with random or non-optimized images from the same diffusion model and finding no comparable rise in unsafe response rates would indicate the effect depends on the specific visuals selected by RedDiffuser.
Figures
read the original abstract
Large Vision-Language Models (VLMs) are increasingly deployed in open-ended environments, where ensuring reliable safety under multimodal inputs is critical. However, existing evaluations remain largely instruction-centric, focusing on explicit malicious queries while overlooking a more realistic and underexplored risk: whether safety alignment remains robust under harmful contextual exposure. This limitation is particularly important for multimodal systems, where visual inputs can substantially steer model behavior and render text-only auditing insufficient. In this work, we study multimodal safety auditing under harmful contextual exposure, asking whether VLMs can maintain safe behavior when partial toxic text is paired with visual context. To enable systematic auditing, we propose RedDiffuser (RedDiff), a reinforcement-based framework that leverages diffusion models to generate semantically coherent visual inputs for black-box safety testing. By combining greedy prompt search with reinforcement optimization, RedDiffuser uncovers high-risk multimodal inputs that expose latent safety failures. Extensive experiments on both open-source and commercial VLMs show that such context-conditioned failures are widespread. On LLaVA, RedDiffuser increases unsafe response rates by up to 10.69% on the original set and 8.91% on a hold-out set, with strong transferability to Gemini and LLaMA-Vision. These vulnerabilities persist even under external safety guardrails, suggesting that current system-level safety mechanisms remain insufficient for realistic multimodal risks. Our findings reveal a critical blind spot in existing safety evaluations and establish context-aware multimodal auditing as an essential paradigm for diagnosing hidden vulnerabilities in modern VLM systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RedDiffuser, a reinforcement-based auditing framework that combines greedy prompt search with diffusion model optimization to generate semantically coherent visual inputs paired with partial toxic text. It claims this exposes widespread context-conditioned safety failures in VLMs, reporting unsafe response rate increases of up to 10.69% on LLaVA (original set) and 8.91% (hold-out set), with transferability to Gemini and LLaMA-Vision, and persistence under external guardrails.
Significance. If the results hold after verification, the work would establish context-aware multimodal auditing as a necessary complement to text-only evaluations, revealing a blind spot in VLM safety alignments. The reinforced diffusion approach offers a systematic, black-box method for generating high-risk test cases that could inform more robust alignment techniques.
major comments (2)
- [Abstract and experimental results section] Abstract and experimental results section: The central claim attributes the reported unsafe response increases (up to 10.69% on LLaVA) to harmful visual context, but provides no quantitative validation such as CLIP similarity to toxic prompts, human coherence ratings, or ablations against non-optimized diffusion samples. This is load-bearing for interpreting the percentages as evidence of multimodal safety failures rather than diffusion/optimization artifacts.
- [Methods and experimental results] Methods and experimental results: The abstract reports specific percentage increases, transferability, and persistence under guardrails with no details on unsafe response measurement criteria, baseline comparisons, statistical significance testing, or controls for confounding factors in the optimization process. These omissions directly affect the soundness of the quantitative claims.
minor comments (1)
- The abstract would be clearer with explicit mention of the total number of VLMs evaluated and the size of the hold-out set.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger validation and experimental transparency. We address each major comment below and have revised the manuscript to incorporate the suggested additions.
read point-by-point responses
-
Referee: [Abstract and experimental results section] Abstract and experimental results section: The central claim attributes the reported unsafe response increases (up to 10.69% on LLaVA) to harmful visual context, but provides no quantitative validation such as CLIP similarity to toxic prompts, human coherence ratings, or ablations against non-optimized diffusion samples. This is load-bearing for interpreting the percentages as evidence of multimodal safety failures rather than diffusion/optimization artifacts.
Authors: We agree that quantitative validation is essential to attribute the unsafe response increases specifically to harmful visual context. In the revised manuscript we have added CLIP similarity scores between generated images and toxic prompts, human coherence and relevance ratings on sampled outputs (with reported inter-rater agreement), and ablation comparisons against non-optimized diffusion samples. These results are now presented in the experimental results section and support the interpretation that the observed increases reflect context-conditioned safety failures. revision: yes
-
Referee: [Methods and experimental results] Methods and experimental results: The abstract reports specific percentage increases, transferability, and persistence under guardrails with no details on unsafe response measurement criteria, baseline comparisons, statistical significance testing, or controls for confounding factors in the optimization process. These omissions directly affect the soundness of the quantitative claims.
Authors: We acknowledge these omissions affect interpretability. The revised manuscript now details the unsafe response measurement protocol (hybrid keyword and LLM-judge approach with agreement statistics), includes baseline comparisons to text-only and non-reinforced diffusion conditions, reports statistical significance testing (paired tests with p-values), and describes controls for optimization confounders such as iteration count and prompt length. These additions appear in the Methods and Experimental Results sections. revision: yes
Circularity Check
No circularity; empirical auditing framework with independent experimental results
full rationale
The paper proposes RedDiffuser as a reinforcement-based diffusion framework for generating visual contexts to audit VLMs and reports measured increases in unsafe response rates from experiments on models like LLaVA. No equations, derivations, or self-referential definitions appear in the provided text that would reduce any claimed outcome to a fitted parameter or input by construction. The central results are empirical measurements on external VLMs, which remain falsifiable and independent of the method's internal definition. No self-citation load-bearing steps or uniqueness theorems are invoked. This is a standard empirical contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauthet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Y . Li, H. Guo, K. Zhou, W. X. Zhao, and J.-R. Wen, “Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking mul- timodal large language models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 174–189
work page 2024
-
[4]
Visual adversarial examples jailbreak aligned large language models,
X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal, “Visual adversarial examples jailbreak aligned large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 19, 2024, pp. 21 527–21 536
work page 2024
-
[5]
White- box multimodal jailbreaks against large vision-language models,
R. Wang, X. Ma, H. Zhou, C. Ji, G. Ye, and Y .-G. Jiang, “White- box multimodal jailbreaks against large vision-language models,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6920–6928
work page 2024
-
[6]
Diffusion models for adversarial purifi- cation.arXiv preprint arXiv:2205.07460,
W. Nie, B. Guo, Y . Huang, C. Xiao, A. Vahdat, and A. Anandku- mar, “Diffusion models for adversarial purification,”arXiv preprint arXiv:2205.07460, 2022
-
[7]
A mutation-based method for multi-modal jailbreaking attack detection,
X. Zhang, C. Zhang, T. Li, Y . Huang, X. Jia, X. Xie, Y . Liu, and C. Shen, “A mutation-based method for multi-modal jailbreaking attack detection,”arXiv preprint arXiv:2312.10766, 2023
-
[8]
Failures to find transferable image jailbreaks between vision-language models,
R. Schaeffer, D. Valentine, L. Bailey, J. Chua, Z. Durante, C. Eyzaguirre, J. Benton, B. Miranda, H. Sleight, T. T. Wanget al., “Failures to find transferable image jailbreaks between vision-language models,” in Workshop on Socially Responsible Language Modelling Research
-
[9]
Fig- 8 Step: Jailbreaking Large Vision-language Models via Typo- graphic Visual Prompts
Y . Gong, D. Ran, J. Liu, C. Wang, T. Cong, A. Wang, S. Duan, and X. Wang, “Figstep: Jailbreaking large vision-language models via typographic visual prompts,”arXiv preprint arXiv:2311.05608, 2023
-
[10]
Query-relevant images jailbreak large multi-modal models,
X. Liu, Y . Zhu, Y . Lan, C. Yang, and Y . Qiao, “Query-relevant images jailbreak large multi-modal models,”arXiv preprint arXiv:2311.17600, 2023
-
[11]
Ideator: Jailbreaking and benchmarking large vision- language models using themselves,
R. Wang, J. Li, Y . Wang, B. Wang, X. Wang, Y . Teng, Y . Wang, X. Ma, and Y .-G. Jiang, “Ideator: Jailbreaking and benchmarking large vision- language models using themselves,”arXiv preprint arXiv:2411.00827, 2024
-
[12]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695
work page 2022
-
[13]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024
work page 2024
-
[14]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763
work page 2021
-
[15]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[18]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,
W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalezet al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), 2023
work page 2023
-
[19]
Instructblip: Towards general-purpose vision- language models with instruction tuning,
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” 2023
work page 2023
-
[20]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,”arXiv preprint arXiv:2301.12597, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
E. Shayegani, M. A. A. Mamun, Y . Fu, P. Zaree, Y . Dong, and N. Abu- Ghazaleh, “Survey of vulnerabilities in large language models revealed by adversarial attacks,”arXiv preprint arXiv:2310.10844, 2023
-
[22]
A survey of attacks on large vision- language models: Resources, advances, and future trends
D. Liu, M. Yang, X. Qu, P. Zhou, W. Hu, and Y . Cheng, “A survey of attacks on large vision-language models: Resources, advances, and future trends,”arXiv preprint arXiv:2407.07403, 2024
-
[23]
Privacy in large language models: Attacks, defenses and future directions,
H. Li, Y . Chen, J. Luo, J. Wang, H. Peng, Y . Kang, X. Zhang, Q. Hu, C. Chan, Z. Xuet al., “Privacy in large language models: Attacks, defenses and future directions,”arXiv preprint arXiv:2310.10383, 2023
-
[24]
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
X. Ma, Y . Gao, Y . Wang, R. Wang, X. Wang, Y . Sun, Y . Ding, H. Xu, Y . Chen, Y . Zhaoet al., “Safety at scale: A comprehensive survey of large model safety,”arXiv preprint arXiv:2502.05206, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
(ab) using images and sounds for indirect instruction injection in multi-modal llms,
E. Bagdasaryan, T.-Y . Hsieh, B. Nassi, and V . Shmatikov, “(ab) using images and sounds for indirect instruction injection in multi-modal llms,” arXiv preprint arXiv:2307.10490, 2023
-
[26]
Image hijacks: Adversarial images can control generative models at runtime
L. Bailey, E. Ong, S. Russell, and S. Emmons, “Image hijacks: Adver- sarial images can control generative models at runtime,”arXiv preprint arXiv:2309.00236, 2023
-
[27]
Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models,
E. Shayegani, Y . Dong, and N. Abu-Ghazaleh, “Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models,” inThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[28]
Are aligned neural networks adversarially aligned?
N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, I. Gao, P. W. W. Koh, D. Ippolito, F. Tramer, and L. Schmidt, “Are aligned neural networks adversarially aligned?”Advances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[29]
Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024
Z. Niu, H. Ren, X. Gao, G. Hua, and R. Jin, “Jailbreaking attack against multimodal large language model,”arXiv preprint arXiv:2402.02309, 2024
-
[30]
S. Ma, W. Luo, Y . Wang, X. Liu, M. Chen, B. Li, and C. Xiao, “Visual- roleplay: Universal jailbreak attack on multimodal large language mod- els via role-playing image characte,”arXiv preprint arXiv:2405.20773, 2024
-
[31]
Training Diffusion Models with Reinforcement Learning
K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine, “Train- ing diffusion models with reinforcement learning,”arXiv preprint arXiv:2305.13301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Approximately optimal approximate re- inforcement learning,
S. Kakade and J. Langford, “Approximately optimal approximate re- inforcement learning,” inProceedings of the Nineteenth International Conference on Machine Learning, 2002, pp. 267–274
work page 2002
-
[33]
Language Models are Few-Shot Learners
T. B. Brown, “Language models are few-shot learners,”arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
- [34]
-
[35]
BERTScore: Evaluating Text Generation with BERT
T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[36]
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,”arXiv preprint arXiv:2009.11462, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[37]
Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp,
T. Schick, S. Udupa, and H. Sch ¨utze, “Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 1408–1424, 2021
work page 2021
-
[38]
Robust con- versational agents against imperceptible toxicity triggers,
N. Mehrabi, A. Beirami, F. Morstatter, and A. Galstyan, “Robust con- versational agents against imperceptible toxicity triggers,”arXiv preprint arXiv:2205.02392, 2022
-
[39]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y . Choi, “Clipscore: A reference-free evaluation metric for image captioning,”arXiv preprint arXiv:2104.08718, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[40]
Scaling laws for reward model overoptimization,
L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model overoptimization,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 10 835–10 866
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.