Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models

Abrar Alotaibi; Moataz Ahmed

arxiv: 2606.26566 · v1 · pith:BT3YTTGLnew · submitted 2026-06-25 · 💻 cs.CR · cs.CL

Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models

Abrar Alotaibi , Moataz Ahmed This is my paper

Pith reviewed 2026-06-26 04:33 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords adversarial attacksdiffusion modelslarge language modelsvision-language modelssurveytaxonomydefensesevaluation framework

0 comments

The pith

Four disconnected adversarial diffusion tracks are fused into one taxonomy, threat model axis, and five-dimension evaluation framework focused on LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper integrates diffusion-based attacks on text and LLMs, diffusion attacks on image classifiers, jailbreak pipelines against vision-language models, and diffusion-based input purification defenses. These tracks have used separate vocabularies and benchmarks, and the survey supplies a shared conceptual structure with a six-class taxonomy of diffusion roles plus uniform criteria for success rate, transferability, query budget, perplexity, and defense evasion. A sympathetic reader would care because the shared generative mechanism now appears across modalities, so a common language could let researchers compare attacks and defenses that were previously isolated.

Core claim

Diffusion models act as a portable generative component that can be slotted into adversarial pipelines in multiple modalities; cataloging fifty papers under a unified six-class taxonomy of diffusion roles, a threat-model axis that records attacker knowledge, query budget and target accessibility, and a five-dimension evaluation framework produces a coherent dual attacker-defender view centered on the LLM slice.

What carries the argument

Six-class taxonomy of diffusion roles in adversarial pipelines, augmented by a threat-model axis and a five-dimension evaluation framework applied uniformly across modalities.

If this is right

Any new attack must be compared against the ten listed non-diffusion baselines.
The four diffusion-based defenses form the natural evaluation backdrop for measuring new attacks.
Five recurring weaknesses in the current LLM-side literature are now identified for targeted follow-up.
The released catalog and spreadsheet support direct reuse of the taxonomy and criteria.
A research agenda of open questions and concrete experimental designs follows from the fusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unified criteria could be tested by running the same attack recipe on both an image classifier and an LLM to measure cross-modal transferability.
The narrative review format implies that future work could apply the same taxonomy inside a PRISMA-style systematic review to check coverage.
Focusing the framework on the LLM slice may surface language-specific vulnerabilities, such as perplexity sensitivity, that image-only studies miss.
The dual attacker-defender view suggests experiments that pit each diffusion role directly against each of the four defenses to quantify evasion margins.

Load-bearing premise

The four tracks have developed largely disconnected vocabularies, threat models, and benchmarks that can be fused without significant loss of domain-specific detail.

What would settle it

A demonstration that mapping any diffusion role or evaluation dimension from the image-classifier track to the LLM track erases essential threat details that cannot be recovered in the unified taxonomy.

Figures

Figures reproduced from arXiv: 2606.26566 by Abrar Alotaibi, Moataz Ahmed.

**Figure 1.** Figure 1: Taxonomy of cataloged diffusion attacks: target modality (rows) and diffusion role within each modality [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Cataloged papers by family and year. Color encodes year of publication; family ordering follows the survey’s [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Mechanistic comparison of the four Family A papers. Section [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Adversarial evaluation of AI systems has matured along four largely disconnected tracks: diffusion-based attacks on text and large language models (LLMs), diffusion-based attacks on image classifiers, jailbreak pipelines against vision-language models, and diffusion-based input purification defenses. Each has developed its own vocabulary, threat models, and benchmarks, with denoising diffusion models emerging as a shared generative mechanism whose recipes are now actively ported between communities. This survey performs an information-fusion exercise at the meta-research level: we integrate these four tracks into a single conceptual framework with a unified taxonomy, evaluation criteria, and research agenda, focusing on the LLM-side slice. We catalog fifty published papers across four scope areas (text/LLM, image classifier, vision-language model, defense), plus four diffusion-LLM-as-victim entries and ten non-diffusion baselines against which any new attack must be compared. We propose a six-class taxonomy of diffusion roles in adversarial pipelines, augmented by a threat-model axis recording attacker knowledge, query budget, and target accessibility, and apply a five-dimension framework (attack success rate, transferability, query budget, perplexity, defense-evasion) uniformly across modalities. The review adopts a dual attacker-defender perspective: alongside the attack catalog we cover four diffusion-based defenses that form the natural evaluation backdrop for new attacks. Our critical analysis identifies five recurring weaknesses of the current LLM-side literature, and we close with a research agenda of open questions and concrete experimental designs. The companion catalog and spreadsheet are released with the paper. We are explicit that this is a narrative review with quality assessment, not a PRISMA-compliant systematic review, and discuss the implications for replication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This narrative survey offers a six-class taxonomy and five-dimension evaluation framework to fuse diffusion adversarial work across text, vision, and VLMs, but its added value rests on how disconnected those tracks actually were.

read the letter

The main thing to know is that this paper catalogs fifty works on diffusion-based attacks and defenses, then proposes a unified taxonomy of diffusion roles in adversarial pipelines plus a threat-model axis and five evaluation dimensions applied uniformly across modalities. It also releases a companion catalog and spreadsheet.

What it does well is pull the four tracks (text/LLM attacks, image classifier attacks, VLM jailbreaks, and diffusion defenses) into one place with a dual attacker-defender lens and flag five recurring weaknesses on the LLM side. The research agenda includes concrete experimental designs, which could help someone new to the area get oriented quickly.

The soft spots are straightforward. This is explicitly a narrative review without a described search strategy or inclusion criteria, so the completeness of the fifty-paper set is hard to judge and selection effects are possible. The central fusion claim depends on the tracks having developed largely separate vocabularies and benchmarks; if cross-citations already exist in the cited papers, the unification adds less and risks flattening domain details. Soundness here is really about the accuracy of the catalog rather than any new derivation.

This is for researchers working on adversarial ML with diffusion models, especially the LLM slice, who want a map and pointers rather than new experiments. A reader who needs the spreadsheet or the taxonomy for their own comparisons would get practical value.

It deserves peer review because a careful synthesis with released artifacts can still organize a fragmented corner of the field, even if the framework needs tightening. I would send it to referees rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper is a narrative survey cataloging fifty papers on adversarial diffusion models across four tracks—diffusion-based attacks on text/LLMs, diffusion-based attacks on image classifiers, jailbreak pipelines on vision-language models, and diffusion-based input purification defenses—plus four diffusion-LLM-as-victim entries and ten non-diffusion baselines. It asserts that these tracks developed largely disconnected vocabularies, threat models, and benchmarks, and contributes a six-class taxonomy of diffusion roles, a threat-model axis (attacker knowledge, query budget, target accessibility), a uniform five-dimension evaluation framework (attack success rate, transferability, query budget, perplexity, defense-evasion), a dual attacker-defender perspective covering four defenses, identification of five recurring LLM-side weaknesses, and a research agenda, while releasing a companion catalog and spreadsheet. The work explicitly positions itself as a narrative review rather than PRISMA-compliant.

Significance. If the catalog is accurate and the fusion premise holds, the unified taxonomy and five-dimension framework could provide a useful meta-research lens for cross-modal adversarial work involving diffusion models, with the released catalog and spreadsheet offering concrete value for reproducibility and follow-on studies. The dual perspective and explicit discussion of narrative-review limitations are constructive.

major comments (2)

[Abstract] Abstract: the central claim that the four tracks 'have developed largely disconnected vocabularies, threat models, and benchmarks' is asserted without any quantification of cross-citations, shared formalisms, or overlap analysis; because the paper's contribution is precisely the meta-level fusion, this premise is load-bearing and requires substantiation (e.g., a table or subsection counting cross-track citations among the 50 papers) to establish that unification is non-redundant.
[Abstract] Abstract and opening sections: no search strategy, inclusion/exclusion criteria, or date range is described for selecting the fifty papers (or the additional four and ten entries), even though the text acknowledges the narrative-review nature; this directly affects the reliability of the catalog and the claim of a 'single conceptual framework' that integrates the space without significant loss of domain-specific detail.

minor comments (1)

[Abstract] The abstract states the work 'focuses on the LLM-side slice' while covering all four tracks; a brief clarification in the introduction on how modality-specific details are preserved versus abstracted in the unified taxonomy would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on the abstract and framing. We address each point below and will incorporate revisions to improve transparency and substantiation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the four tracks 'have developed largely disconnected vocabularies, threat models, and benchmarks' is asserted without any quantification of cross-citations, shared formalisms, or overlap analysis; because the paper's contribution is precisely the meta-level fusion, this premise is load-bearing and requires substantiation (e.g., a table or subsection counting cross-track citations among the 50 papers) to establish that unification is non-redundant.

Authors: We agree the claim would be stronger with explicit support. In revision we will add a short subsection (and accompanying table) that reports our observed citation patterns across the 50 papers, including counts of cross-track references and shared formalisms where they exist. This analysis draws directly from the papers already catalogued; while we will not perform a full bibliometric study, the added table will quantify the limited overlap that motivated the fusion exercise. revision: yes
Referee: [Abstract] Abstract and opening sections: no search strategy, inclusion/exclusion criteria, or date range is described for selecting the fifty papers (or the additional four and ten entries), even though the text acknowledges the narrative-review nature; this directly affects the reliability of the catalog and the claim of a 'single conceptual framework' that integrates the space without significant loss of domain-specific detail.

Authors: We accept the point on transparency. Although the manuscript already states it is a narrative rather than PRISMA review, we will expand the opening section to describe the practical selection process used: approximate date range (papers up to mid-2024), relevance criteria for each of the four tracks, and how the additional baseline entries were chosen. This addition will clarify scope without converting the review into a systematic one. revision: yes

Circularity Check

0 steps flagged

No circularity: narrative survey with no derivations or self-referential claims

full rationale

The paper is a narrative review cataloging 50+ existing works into a proposed taxonomy and evaluation framework. No equations, predictions, fitted parameters, or derivations appear. The premise of 'largely disconnected tracks' is an empirical observation about the literature rather than a self-defined or self-cited load-bearing step that reduces to the paper's own inputs. The contribution is organizational; the taxonomy is not claimed to be mathematically forced or derived from prior author work. This matches the default expectation of no significant circularity for survey-style papers without quantitative modeling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey without new derivations; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5846 in / 1222 out tokens · 34223 ms · 2026-06-26T04:33:08.514705+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 75 canonical work pages · 22 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al., 2023. GPT-4 technical report. doi:10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[2]

Structureddenoisingdiffusionmodelsindiscretestate-spaces,in:Advances in Neural Information Processing Systems

Austin,J.,Johnson,D.,Ho,J.,Tarlow,D.,vandenBerg,R.,2021. Structureddenoisingdiffusionmodelsindiscretestate-spaces,in:Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2107.03006

work page doi:10.48550/arxiv.2107.03006 2021
[3]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramèr, F., Hassani, H., Wong, E., 2024. JailbreakBench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318 doi:10.48550/arXiv.2404.01318

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.01318 2024
[4]

Jailbreaking Black Box Large Language Models in Twenty Queries

Chao,P.,Robey,A.,Dobriban,E.,Hassani,H.,Pappas,G.J.,Wong,E.,2023. Jailbreakingblackboxlargelanguagemodelsintwentyqueries. arXiv preprint arXiv:2310.08419 doi:10.48550/arXiv.2310.08419

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.08419 2023
[5]

Diffusion models for imperceptible and transferable adversarial attack

Chen, J., Chen, H., Chen, K., Zhang, Y., Zou, Z., Shi, Z., 2024. Diffusion models for imperceptible and transferable adversarial attack. IEEE Transactions on Pattern Analysis and Machine Intelligence doi:10.1109/TPAMI.2024.3372023

work page doi:10.1109/tpami.2024.3372023 2024
[6]

Chen,X.,Gao,X.,Zhao,J.,Ye,K.,Xu,C.Z.,2023a.AdvDiffuser:Naturaladversarialexamplesynthesiswithdiffusionmodels,in:Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
[7]

Natural adversarial patch generation method based on latent diffusion model

Chen, X., Liu, F., Jiang, D., Yan, K., 2023b. Natural adversarial patch generation method based on latent diffusion model. arXiv preprint arXiv:2312.16401 doi:10.48550/arXiv.2312.16401

work page doi:10.48550/arxiv.2312.16401
[8]

Content-based unrestricted adversarial attack, in: Advances in Neural Information Processing Systems

Chen, Z., Li, B., Wu, S., Ding, S., Zhang, W., 2023c. Content-based unrestricted adversarial attack, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2305.10665

work page doi:10.48550/arxiv.2305.10665
[9]

NatADiff: Adversarial boundary guidance for natural adversarial diffusion

Collins, M., Vice, J., French, T., Mian, A., 2025. NatADiff: Adversarial boundary guidance for natural adversarial diffusion. arXiv preprint arXiv:2505.20934 doi:10.48550/arXiv.2505.20934

work page doi:10.48550/arxiv.2505.20934 2025
[10]

Risk taxonomy, mitigation, and assessment benchmarks of large language model systems

Cui, T., Wang, Y., Fu, C., Xiao, Y., Li, S., Deng, X., Liu, Y., Zhang, Q., Qiu, Z., Li, P., Tan, Z., Xiong, J., Kong, X., Wen, Z., Xu, K., Li, Q., 2024. Risk taxonomy, mitigation, and assessment benchmarks of large language model systems. arXiv preprint arXiv:2401.05778 doi:10.48550/arXiv.2401.05778

work page doi:10.48550/arxiv.2401.05778 2024
[11]

AdvDiff: Generating unrestricted adversarial examples using diffusion models, in: European Conference on Computer Vision (ECCV)

Dai, X., Liang, K., Xiao, B., 2024. AdvDiff: Generating unrestricted adversarial examples using diffusion models, in: European Conference on Computer Vision (ECCV). doi:10.48550/arXiv.2307.12499

work page doi:10.48550/arxiv.2307.12499 2024
[12]

SemDiff: Generating natural unrestricted adversarial examples via semantic attributes optimization in diffusion models

Dai, Z., Liu, S., He, R., Wu, J., Lu, N., Fan, W., Li, Q., Tang, K., 2025. SemDiff: Generating natural unrestricted adversarial examples via semantic attributes optimization in diffusion models. arXiv preprint arXiv:2504.11923 doi:10.48550/arXiv.2504.11923

work page doi:10.48550/arxiv.2504.11923 2025
[13]

ImageNet:Alarge-scalehierarchicalimagedatabase,in:Proceedingsofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Deng,J.,Dong,W.,Socher,R.,Li,L.J.,Li,K.,Fei-Fei,L.,2009. ImageNet:Alarge-scalehierarchicalimagedatabase,in:Proceedingsofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2009
[14]

Diffusion Models Beat GANs on Image Synthesis

Dhariwal, P., Nichol, A.Q., 2021. Diffusion models beat GANs on image synthesis, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2105.05233

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2105.05233 2021
[15]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al., 2022. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, in: arXiv preprint arXiv:2209.07858. doi:10.48550/arXiv.2209.07858

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.07858 2022
[16]

, booktitle =

Gehman, S., Gururangan, S., Sap, M., Choi, Y., Smith, N.A., 2020. RealToxicityPrompts: Evaluating neural toxic degeneration in language models, in: Findings of the Association for Computational Linguistics: EMNLP. doi:10.18653/v1/2020.findings-emnlp.301

work page doi:10.18653/v1/2020.findings-emnlp.301 2020
[17]

MaskPure:Improvingdefenseagainsttextadversarieswithstochasticpurification,in:NaturalLanguageProcessing and Information Systems (NLDB)

Gietz,H.,Kalita,J.,2024. MaskPure:Improvingdefenseagainsttextadversarieswithstochasticpurification,in:NaturalLanguageProcessing and Information Systems (NLDB). doi:10.1007/978-3-031-70239-6_26

work page doi:10.1007/978-3-031-70239-6_26 2024
[18]

FigStep: Jailbreaking large vision-language models via typographic visual prompts, in: Proceedings of the AAAI Conference on Artificial Intelligence

Gong, Y., Ran, D., Liu, J., Wang, C., Cong, T., Wang, A., Duan, S., Wang, X., 2025. FigStep: Jailbreaking large vision-language models via typographic visual prompts, in: Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.48550/arXiv.2311.05608

work page doi:10.48550/arxiv.2311.05608 2025
[19]

Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models

Guo, Q., Pang, S., Jia, X., Liu, Y., Guo, Q., 2024. Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models. IEEE Transactions on Information Forensics and Security doi:10.1109/TIFS.2024.3518072

work page doi:10.1109/tifs.2024.3518072 2024
[20]

Denoising Diffusion Probabilistic Models

Ho, J., Jain, A., Abbeel, P., 2020. Denoising diffusion probabilistic models, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2006.11239

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2006.11239 2020
[21]

ScoreAdv: Score-based targeted generation of natural adversarial examples via diffusion models

Huang, C., Tang, H., 2025. ScoreAdv: Score-based targeted generation of natural adversarial examples via diffusion models. arXiv preprint arXiv:2507.06078 doi:10.48550/arXiv.2507.06078

work page doi:10.48550/arxiv.2507.06078 2025
[22]

A survey of safety and trustworthiness of large language models through the lens of verification and validation

Huang, X., Ruan, W., Huang, W., Jin, G., Dong, Y., Wu, C., Bensalem, S., Mu, R., Qi, Y., Zhao, X., Cai, K., Zhang, Y., Wu, S., Xu, P., Wu, D., Freitas, A., Mustafa, M.A., 2024. A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review doi:10.1007/s10462-024-10884-2

work page doi:10.1007/s10462-024-10884-2 2024
[23]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan,H.,Upasani,K.,Chi,J.,Rungta,R.,Iyer,K.,Mao,Y.,Tontchev,M.,Hu,Q.,Fuller,B.,Testuggine,D.,Khabsa,M.,2023. Llamaguard: LLM-based input-output safeguard for human-AI conversations. arXiv preprint arXiv:2312.06674 doi:10.48550/arXiv.2312.06674

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.06674 2023
[24]

Redteaminglargelanguagemodels:Acomprehensivereviewandcriticalanalysis

Jabbar,M.S.,Al-Azani,S.,Alotaibi,A.,Ahmed,M.,2025. Redteaminglargelanguagemodels:Acomprehensivereviewandcriticalanalysis. Preprint submitted to Elsevier

2025
[25]

Enhancing diffusion-based unrestricted adversarial attacks via adversary preferences alignment

Jiang, K., Chen, Z., Guo, H., Li, J., Fu, J., Guo, P., Tang, H., Li, B., Zhang, W., 2025. Enhancing diffusion-based unrestricted adversarial attacks via adversary preferences alignment. arXiv preprint arXiv:2506.01511 doi:10.48550/arXiv.2506.01511

work page doi:10.48550/arxiv.2506.01511 2025
[26]

Is BERT really robust? a strong baseline for natural language attack on text classification and entailment (TextFooler)

Jin, D., Jin, Z., Zhou, J.T., Szolovits, P., 2020. Is BERT really robust? a strong baseline for natural language attack on text classification and entailment (TextFooler). Proceedings of the AAAI Conference on Artificial Intelligence doi:10.1609/aaai.v34i05.6311. A. Alotaibi and M. Ahmed:Preprint submitted to ElsevierPage 28 of 30 Adversarial Diffusion Ac...

work page doi:10.1609/aaai.v34i05.6311 2020
[27]

DiffAttack: Evasion attacks against diffusion-based adversarial purification, in: Advances in Neural Information Processing Systems

Kang, M., Song, D., Li, B., 2024. DiffAttack: Evasion attacks against diffusion-based adversarial purification, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2311.16124

work page doi:10.48550/arxiv.2311.16124 2024
[28]

The measurement of observer agreement for categorical data

Landis, J.R., Koch, G.G., 1977. The measurement of observer agreement for categorical data. Biometrics 33, 159–174

1977
[29]

Textadversarialpurificationasdefenseagainstadversarialattacks,in:Proceedingsofthe61stAnnualMeeting of the Association for Computational Linguistics (ACL), pp

Li,L.,Song,D.,Qiu,X.,2023. Textadversarialpurificationasdefenseagainstadversarialattacks,in:Proceedingsofthe61stAnnualMeeting of the Association for Computational Linguistics (ACL), pp. 338–350. doi:10.18653/v1/2023.acl-long.20

work page doi:10.18653/v1/2023.acl-long.20 2023
[30]

DiffAttack-X: An effective transferable adversarial attack based on diffusion models

Li, L., Zhang, X., Wang, J., et al., 2025a. DiffAttack-X: An effective transferable adversarial attack based on diffusion models. Applied Intelligence 55, 1062. doi:10.1007/s10489-025-06957-6

work page doi:10.1007/s10489-025-06957-6
[31]

Imagesareachilles’heelofalignment:Exploitingvisualvulnerabilitiesforjailbreaking multimodal large language models, in: European Conference on Computer Vision (ECCV)

Li,Y.,Guo,H.,Zhou,K.,Zhao,W.X.,Wen,J.R.,2024. Imagesareachilles’heelofalignment:Exploitingvisualvulnerabilitiesforjailbreaking multimodal large language models, in: European Conference on Computer Vision (ECCV). doi:10.48550/arXiv.2403.09792

work page doi:10.48550/arxiv.2403.09792 2024
[32]

doi:10.48550/arXiv

Li, Z., Nie, Z., Zhou, Z., Liu, Y., Zhang, Y., Cheng, Y., Wen, Q., Wang, K., Guo, Y., Zhang, J., 2026. DiffuGuard: How intrinsic safety is lostandfoundindiffusionlargelanguagemodels,in:InternationalConferenceonLearningRepresentations(ICLR). doi:10.48550/arXiv. 2509.24296

work page internal anchor Pith review doi:10.48550/arxiv 2026
[33]

DiffuseDef: Improved robustness to adversarial attacks via iterative denoising, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

Li, Z., Zhou, H., Rei, M., Specia, L., 2025b. DiffuseDef: Improved robustness to adversarial attacks via iterative denoising, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). doi:10.48550/arXiv.2407.00248

work page doi:10.48550/arxiv.2407.00248
[34]

Againsttheachilles’ heel: A survey on red teaming for generative models

Lin,L.,Mu,H.,Zhai,Z.,Wang,M.,Wang,Y.,Wang,R.,Gao,J.,Zhang,Y.,Che,W.,Baldwin,T.,Han,X.,Li,H.,2025. Againsttheachilles’ heel: A survey on red teaming for generative models. Journal of Artificial Intelligence Research doi:10.48550/arXiv.2404.00629

work page doi:10.48550/arxiv.2404.00629 2025
[35]

Microsoft COCO: Common objects in context, in: European Conference on Computer Vision (ECCV)

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft COCO: Common objects in context, in: European Conference on Computer Vision (ECCV)

2014
[36]

Adv-diffusion:Imperceptibleadversarialfaceidentityattackvialatentdiffusion model, in: Proceedings of the AAAI Conference on Artificial Intelligence

Liu,D.,Wang,X.,Peng,C.,Wang,N.,Hu,R.,Gao,X.,2024a. Adv-diffusion:Imperceptibleadversarialfaceidentityattackvialatentdiffusion model, in: Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.1609/aaai.v38i4.28067

work page doi:10.1609/aaai.v38i4.28067
[37]

DiffProtect: Generate adversarial examples with diffusion models for facial privacy protection

Liu, J., Lau, C.P., Chellappa, R., 2023. DiffProtect: Generate adversarial examples with diffusion models for facial privacy protection. arXiv preprint arXiv:2305.13625 doi:10.48550/arXiv.2305.13625

work page doi:10.48550/arxiv.2305.13625 2023
[38]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Liu, X., Xu, N., Chen, M., Xiao, C., 2024b. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models, in: International Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2310.04451

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.04451
[39]

MM-SafetyBench:Abenchmarkforsafetyevaluationofmultimodallargelanguage models, in: European Conference on Computer Vision (ECCV)

Liu,X.,Zhu,Y.,Gu,J.,Lan,Y.,Yang,C.,Qiao,Y.,2024c. MM-SafetyBench:Abenchmarkforsafetyevaluationofmultimodallargelanguage models, in: European Conference on Computer Vision (ECCV). doi:10.48550/arXiv.2311.17600

work page doi:10.48550/arxiv.2311.17600
[40]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., Liu, Y., 2024d. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 doi:10.48550/arXiv.2305.13860

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.13860
[41]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Lou, A., Meng, C., Ermon, S., 2024. Discrete diffusion modeling by estimating the ratios of the data distribution (SEDD), in: International Conference on Machine Learning (ICML). doi:10.48550/arXiv.2310.16834

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.16834 2024
[42]

DiffusionLLMsarenaturaladversariesforanyLLM

Lüdke,D.,Wollschläger,T.,Ungermann,P.,Günnemann,S.,Schwinn,L.,2025. DiffusionLLMsarenaturaladversariesforanyLLM. arXiv preprint arXiv:2511.00203 doi:10.48550/arXiv.2511.00203

work page doi:10.48550/arxiv.2511.00203 2025
[43]

Visual-roleplay:Universaljailbreakattackonmultimodallargelanguage models via role-playing image character

Ma,S.,Luo,W.,Wang,Y.,Liu,X.,Chen,M.,Li,B.,Xiao,C.,2024. Visual-roleplay:Universaljailbreakattackonmultimodallargelanguage models via role-playing image character. arXiv preprint arXiv:2405.20773 doi:10.48550/arXiv.2405.20773

work page doi:10.48550/arxiv.2405.20773 2024
[44]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika,M.,Phan,L.,Yin,X.,Zou,A.,Wang,Z.,Mu,N.,Sakhaee,E.,Li,N.,Basart,S.,Li,B.,Forsyth,D.,Hendrycks,D.,2024.HarmBench: Astandardizedevaluationframeworkforautomatedredteamingandrobustrefusal,in:InternationalConferenceonMachineLearning(ICML). doi:10.48550/arXiv.2402.04249

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.04249 2024
[45]

FLIRT:Feedbackloopin-context red teaming

Mehrabi,N.,Goyal,P.,Dupuy,C.,Hu,Q.,Ghosh,S.,Zemel,R.,Chang,K.W.,Galstyan,A.,Gupta,R.,2023. FLIRT:Feedbackloopin-context red teaming. arXiv preprint arXiv:2308.04265

arXiv 2023
[46]

AdvLogo: Adversarial patch attack against object detectors based on diffusion models

Miao, B., Li, C., Zhu, Y., Sun, W., Wang, Z., Wang, X., Xie, C., 2024. AdvLogo: Adversarial patch attack against object detectors based on diffusion models. arXiv preprint arXiv:2409.07002 doi:10.48550/arXiv.2409.07002

work page doi:10.48550/arxiv.2409.07002 2024
[47]

GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models, in: International Conference on Machine Learning (ICML)

Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M., 2022. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models, in: International Conference on Machine Learning (ICML). doi:10.48550/ arXiv.2112.10741

Pith/arXiv arXiv 2022
[48]

Large Language Diffusion Models

Nie,S.,Zhu,F.,You,Z.,Zhang,X.,Ou,J.,Hu,J.,Zhou,J.,Lin,Y.,Wen,J.R.,Li,C.,2025. Largelanguagediffusionmodels(LLaDA). arXiv preprint arXiv:2502.09992 doi:10.48550/arXiv.2502.09992

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.09992 2025
[49]

Diffusion models for adversarial purification, in: Proceedings of the 39th International Conference on Machine Learning (ICML)

Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., Anandkumar, A., 2022. Diffusion models for adversarial purification, in: Proceedings of the 39th International Conference on Machine Learning (ICML)

2022
[50]

Text-diffusionred-teamingoflargelanguagemodels:Unveilingharmfulbehaviorswithproximity constraints, in: Proceedings of the AAAI Conference on Artificial Intelligence

Nöther,J.,Singla,A.,Radanović,G.,2025. Text-diffusionred-teamingoflargelanguagemodels:Unveilingharmfulbehaviorswithproximity constraints, in: Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.48550/arXiv.2501.08246

work page doi:10.48550/arxiv.2501.08246 2025
[51]

The 18 PRISMA 2020 statement: An updated guideline for reporting systematic reviews

Page,M.J.,McKenzie,J.E.,Bossuyt,P.M.,Boutron,I.,Hoffmann,T.C.,Mulrow,C.D.,Shamseer,L.,Tetzlaff,J.M.,Akl,E.A.,Brennan,S.E., et al., 2021. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372. doi:10.1136/bmj.n71

work page doi:10.1136/bmj.n71 2021
[52]

AdvPrompter: Fast adaptive adversarial prompting for LLMs

Paulus, A., Zharmagambetov, A., Guo, C., Amos, B., Tian, Y., 2024. AdvPrompter: Fast adaptive adversarial prompting for LLMs. arXiv preprint arXiv:2404.16873 doi:10.48550/arXiv.2404.16873

work page doi:10.48550/arxiv.2404.16873 2024
[53]

Red Teaming Language Models with Language Models

Perez,E.,Huang,S.,Song,F.,Cai,T.,Ring,R.,Aslanides,J.,Glaese,A.,McAleese,N.,Irving,G.,2022. Redteaminglanguagemodelswith language models. arXiv preprint arXiv:2202.03286 doi:10.48550/arXiv.2202.03286

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2202.03286 2022
[54]

Visualadversarialexamplesjailbreakalignedlargelanguagemodels, in: Proceedings of the AAAI Conference on Artificial Intelligence

Qi,X.,Huang,K.,Panda,A.,Henderson,P.,Wang,M.,Mittal,P.,2024. Visualadversarialexamplesjailbreakalignedlargelanguagemodels, in: Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.48550/arXiv.2306.13213

work page doi:10.48550/arxiv.2306.13213 2024
[55]

DiffusionmodelforadversarialattackagainstNLPmodels,in:ProceedingsoftheSPIEVol.13105(ICCAID 2023)

Qiu,S.,Gou,M.,Liang,T.,2024. DiffusionmodelforadversarialattackagainstNLPmodels,in:ProceedingsoftheSPIEVol.13105(ICCAID 2023). doi:10.1117/12.3026312

work page doi:10.1117/12.3026312 2024
[56]

Gradient-based jailbreak images for multimodal fusion models

Rando, J., Korevaar, H., Brinkman, E., Evtimov, I., Tramèr, F., 2024. Gradient-based jailbreak images for multimodal fusion models. arXiv preprint arXiv:2410.03489 doi:10.48550/arXiv.2410.03489. A. Alotaibi and M. Ahmed:Preprint submitted to ElsevierPage 29 of 30 Adversarial Diffusion Across Modalities

work page doi:10.48550/arxiv.2410.03489 2024
[57]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Robey, A., Wong, E., Hassani, H., Pappas, G.J., 2023. SmoothLLM: Defending large language models against jailbreaking attacks, in: arXiv preprint arXiv:2310.03684. doi:10.48550/arXiv.2310.03684

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.03684 2023
[58]

High-Resolution Image Synthesis with Latent Diffusion Models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.48550/arXiv.2112.10752

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.10752 2022
[59]

Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models, in: International Conference on Learning Representations (ICLR)

Shayegani, E., Dong, Y., Abu-Ghazaleh, N., 2024. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models, in: International Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2307.14539

work page doi:10.48550/arxiv.2307.14539 2024
[60]

Simplified and generalized masked diffusion for discrete data (MDLM)

Shi, J., Han, K., Wang, Z., Doucet, A., Titsias, M., 2024. Simplified and generalized masked diffusion for discrete data (MDLM). arXiv preprint arXiv:2406.04329 doi:10.48550/arXiv.2406.04329

work page doi:10.48550/arxiv.2406.04329 2024
[61]

Denoising Diffusion Implicit Models

Song,J.,Meng,C.,Ermon,S.,2021a. Denoisingdiffusionimplicitmodels,in:InternationalConferenceonLearningRepresentations(ICLR). doi:10.48550/arXiv.2010.02502

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.02502 2010
[62]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B., 2021b. Score-based generative modeling through stochastic differential equations, in: International Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2011.13456

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2011.13456 2011
[63]

DiffAM: Diffusion-based adversarial makeup transfer for facial privacy protection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Sun, Y., Yu, L., Xie, H., Li, J., Zhang, Y., 2024. DiffAM: Diffusion-based adversarial makeup transfer for facial privacy protection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.48550/arXiv.2405.09882

work page doi:10.48550/arxiv.2405.09882 2024
[64]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al., 2023. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 doi:10.48550/arXiv.2307.09288

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
[65]

DiffusionAttacker: Diffusion-driven prompt manipulation for LLM jailbreak

Wang, H., Li, H., Zhu, J., Wang, X., Pan, C., Huang, M., Sha, L., 2024a. DiffusionAttacker: Diffusion-driven prompt manipulation for LLM jailbreak. arXiv preprint arXiv:2412.17522 doi:10.48550/arXiv.2412.17522

work page doi:10.48550/arxiv.2412.17522
[66]

IDEATOR: Jailbreaking and benchmarking large vision-language models using themselves, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Wang, R., Li, J., Wang, Y., Wang, B., Wang, X., Teng, Y., Wang, Y., Ma, X., Jiang, Y.G., 2025a. IDEATOR: Jailbreaking and benchmarking large vision-language models using themselves, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). doi:10.48550/arXiv.2411.00827

work page doi:10.48550/arxiv.2411.00827
[67]

White-box multimodal jailbreaks against large vision-language models, in: Proceedings of the 32nd ACM International Conference on Multimedia (MM)

Wang, R., Ma, X., Zhou, H., Ji, C., Ye, G., Jiang, Y.G., 2024b. White-box multimodal jailbreaks against large vision-language models, in: Proceedings of the 32nd ACM International Conference on Multimedia (MM). doi:10.48550/arXiv.2405.17894

work page doi:10.48550/arxiv.2405.17894
[68]

Latent-space diffusion models for stealthy and transferable adversarial attacks on object detection

Wang, W., Qi, H., Huang, Z., Yin, B., et al., 2025b. Latent-space diffusion models for stealthy and transferable adversarial attacks on object detection. Neurocomputing 656, 131456. doi:10.1016/j.neucom.2025.131456

work page doi:10.1016/j.neucom.2025.131456 2025
[69]

InstructTA:Instruction-tunedtargetedattackforlargevision-languagemodels

Wang,X.,Ji,Z.,Ma,P.,Li,Z.,Wang,S.,2023. InstructTA:Instruction-tunedtargetedattackforlargevision-languagemodels. arXivpreprint arXiv:2312.01886 doi:10.48550/arXiv.2312.01886

work page doi:10.48550/arxiv.2312.01886 2023
[70]

BadPatch:Diffusion-basedgenerationofphysicaladversarialpatches

Wang,Z.,Ma,X.,Jiang,Y.G.,2024c. BadPatch:Diffusion-basedgenerationofphysicaladversarialpatches. arXivpreprintarXiv:2412.01440 doi:10.48550/arXiv.2412.01440

work page doi:10.48550/arxiv.2412.01440
[71]

Jailbroken: How Does LLM Safety Training Fail?

Wei, A., Haghtalab, N., Steinhardt, J., 2023. Jailbroken: How does LLM safety training fail?, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2307.02483

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.02483 2023
[72]

The devil behind the mask: An emergent safety vulnerability of diffusion LLMs, in: International Conference on Learning Representations (ICLR)

Wen, Z., Qu, J., Chen, Z., Lu, X., Liu, D., Liu, Z., Wu, R., Yang, Y., Jin, X., Xu, H., Liu, X., Li, W., Lu, C., Shao, J., He, C., Zhang, L., 2026. The devil behind the mask: An emergent safety vulnerability of diffusion LLMs, in: International Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2507.11097

work page doi:10.48550/arxiv.2507.11097 2026
[73]

Xu,W.,Chen,K.,Gao,Z.,Wei,Z.,Chen,J.,Jiang,Y.G.,2024.Highlytransferablediffusion-basedunrestrictedadversarialattackonpre-trained vision-languagemodels,in:Proceedingsofthe32ndACMInternationalConferenceonMultimedia(MM).doi:10.1145/3664647.3681538

work page doi:10.1145/3664647.3681538 2024
[74]

Diff-PGD: Diffusion-based adversarial sample generation for improved stealthiness and controllability, in: Advances in Neural Information Processing Systems

Xue, H., Araujo, A., Hu, B., Chen, Y., 2023. Diff-PGD: Diffusion-based adversarial sample generation for improved stealthiness and controllability, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2305.16494

work page doi:10.48550/arxiv.2305.16494 2023
[75]

Toward safer diffusion language models: Discovery and mitigation of priming vulnerability, in: International Conference on Learning Representations (ICLR)

Yamabe, S., Sakuma, J., 2026. Toward safer diffusion language models: Discovery and mitigation of priming vulnerability, in: International Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2510.00565

work page doi:10.48550/arxiv.2510.00565 2026
[76]

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Yi, S., Liu, Y., Sun, Z., Cong, T., He, X., Song, J., Xu, K., Li, Q., 2024. Jailbreak attacks and defenses against large language models: A survey. arXiv preprint arXiv:2407.04295 doi:10.48550/arXiv.2407.04295

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.04295 2024
[77]

Jailbreak vision language models via bi-modal adversarial prompt, in: Advances in Neural Information Processing Systems

Ying, Z., Liu, A., Zhang, T., Yu, Z., Liang, S., Liu, X., Tao, D., 2024. Jailbreak vision language models via bi-modal adversarial prompt, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2406.04031

work page doi:10.48550/arxiv.2406.04031 2024
[78]

AdvI2I: Adversarial image attack on image-to-image diffusion models, in: International Conference on Machine Learning (ICML)

Zeng, Y., Cao, Y., Cao, B., Chang, Y., Chen, J., Lin, L., 2025. AdvI2I: Adversarial image attack on image-to-image diffusion models, in: International Conference on Machine Learning (ICML). doi:10.48550/arXiv.2410.21471

work page doi:10.48550/arxiv.2410.21471 2025
[79]

Zhang,J.,Ye,J.,Ma,X.,Li,Y.,Yang,Y.,Chen,Y.,Sang,J.,Yeung,D.Y.,2025a. AnyAttack:Towardslarge-scaleself-supervisedadversarial attacks on vision-language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.48550/arXiv.2410.05346

work page doi:10.48550/arxiv.2410.05346
[80]

Jailbreaking large language diffusion models: Revealing hidden safety flaws in diffusion-based text generation

Zhang, Y., Xie, F., Zhou, Z., Li, Z., Chen, H., Wang, K., Guo, Y., 2025b. Jailbreaking large language diffusion models: Revealing hidden safety flaws in diffusion-based text generation. arXiv preprint arXiv:2507.19227 doi:10.48550/arXiv.2507.19227

work page doi:10.48550/arxiv.2507.19227

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al., 2023. GPT-4 technical report. doi:10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023

[2] [2]

Structureddenoisingdiffusionmodelsindiscretestate-spaces,in:Advances in Neural Information Processing Systems

Austin,J.,Johnson,D.,Ho,J.,Tarlow,D.,vandenBerg,R.,2021. Structureddenoisingdiffusionmodelsindiscretestate-spaces,in:Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2107.03006

work page doi:10.48550/arxiv.2107.03006 2021

[3] [3]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramèr, F., Hassani, H., Wong, E., 2024. JailbreakBench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318 doi:10.48550/arXiv.2404.01318

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.01318 2024

[4] [4]

Jailbreaking Black Box Large Language Models in Twenty Queries

Chao,P.,Robey,A.,Dobriban,E.,Hassani,H.,Pappas,G.J.,Wong,E.,2023. Jailbreakingblackboxlargelanguagemodelsintwentyqueries. arXiv preprint arXiv:2310.08419 doi:10.48550/arXiv.2310.08419

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.08419 2023

[5] [5]

Diffusion models for imperceptible and transferable adversarial attack

Chen, J., Chen, H., Chen, K., Zhang, Y., Zou, Z., Shi, Z., 2024. Diffusion models for imperceptible and transferable adversarial attack. IEEE Transactions on Pattern Analysis and Machine Intelligence doi:10.1109/TPAMI.2024.3372023

work page doi:10.1109/tpami.2024.3372023 2024

[6] [6]

Chen,X.,Gao,X.,Zhao,J.,Ye,K.,Xu,C.Z.,2023a.AdvDiffuser:Naturaladversarialexamplesynthesiswithdiffusionmodels,in:Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

[7] [7]

Natural adversarial patch generation method based on latent diffusion model

Chen, X., Liu, F., Jiang, D., Yan, K., 2023b. Natural adversarial patch generation method based on latent diffusion model. arXiv preprint arXiv:2312.16401 doi:10.48550/arXiv.2312.16401

work page doi:10.48550/arxiv.2312.16401

[8] [8]

Content-based unrestricted adversarial attack, in: Advances in Neural Information Processing Systems

Chen, Z., Li, B., Wu, S., Ding, S., Zhang, W., 2023c. Content-based unrestricted adversarial attack, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2305.10665

work page doi:10.48550/arxiv.2305.10665

[9] [9]

NatADiff: Adversarial boundary guidance for natural adversarial diffusion

Collins, M., Vice, J., French, T., Mian, A., 2025. NatADiff: Adversarial boundary guidance for natural adversarial diffusion. arXiv preprint arXiv:2505.20934 doi:10.48550/arXiv.2505.20934

work page doi:10.48550/arxiv.2505.20934 2025

[10] [10]

Risk taxonomy, mitigation, and assessment benchmarks of large language model systems

Cui, T., Wang, Y., Fu, C., Xiao, Y., Li, S., Deng, X., Liu, Y., Zhang, Q., Qiu, Z., Li, P., Tan, Z., Xiong, J., Kong, X., Wen, Z., Xu, K., Li, Q., 2024. Risk taxonomy, mitigation, and assessment benchmarks of large language model systems. arXiv preprint arXiv:2401.05778 doi:10.48550/arXiv.2401.05778

work page doi:10.48550/arxiv.2401.05778 2024

[11] [11]

AdvDiff: Generating unrestricted adversarial examples using diffusion models, in: European Conference on Computer Vision (ECCV)

Dai, X., Liang, K., Xiao, B., 2024. AdvDiff: Generating unrestricted adversarial examples using diffusion models, in: European Conference on Computer Vision (ECCV). doi:10.48550/arXiv.2307.12499

work page doi:10.48550/arxiv.2307.12499 2024

[12] [12]

SemDiff: Generating natural unrestricted adversarial examples via semantic attributes optimization in diffusion models

Dai, Z., Liu, S., He, R., Wu, J., Lu, N., Fan, W., Li, Q., Tang, K., 2025. SemDiff: Generating natural unrestricted adversarial examples via semantic attributes optimization in diffusion models. arXiv preprint arXiv:2504.11923 doi:10.48550/arXiv.2504.11923

work page doi:10.48550/arxiv.2504.11923 2025

[13] [13]

ImageNet:Alarge-scalehierarchicalimagedatabase,in:Proceedingsofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Deng,J.,Dong,W.,Socher,R.,Li,L.J.,Li,K.,Fei-Fei,L.,2009. ImageNet:Alarge-scalehierarchicalimagedatabase,in:Proceedingsofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2009

[14] [14]

Diffusion Models Beat GANs on Image Synthesis

Dhariwal, P., Nichol, A.Q., 2021. Diffusion models beat GANs on image synthesis, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2105.05233

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2105.05233 2021

[15] [15]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al., 2022. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, in: arXiv preprint arXiv:2209.07858. doi:10.48550/arXiv.2209.07858

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.07858 2022

[16] [16]

, booktitle =

Gehman, S., Gururangan, S., Sap, M., Choi, Y., Smith, N.A., 2020. RealToxicityPrompts: Evaluating neural toxic degeneration in language models, in: Findings of the Association for Computational Linguistics: EMNLP. doi:10.18653/v1/2020.findings-emnlp.301

work page doi:10.18653/v1/2020.findings-emnlp.301 2020

[17] [17]

MaskPure:Improvingdefenseagainsttextadversarieswithstochasticpurification,in:NaturalLanguageProcessing and Information Systems (NLDB)

Gietz,H.,Kalita,J.,2024. MaskPure:Improvingdefenseagainsttextadversarieswithstochasticpurification,in:NaturalLanguageProcessing and Information Systems (NLDB). doi:10.1007/978-3-031-70239-6_26

work page doi:10.1007/978-3-031-70239-6_26 2024

[18] [18]

FigStep: Jailbreaking large vision-language models via typographic visual prompts, in: Proceedings of the AAAI Conference on Artificial Intelligence

Gong, Y., Ran, D., Liu, J., Wang, C., Cong, T., Wang, A., Duan, S., Wang, X., 2025. FigStep: Jailbreaking large vision-language models via typographic visual prompts, in: Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.48550/arXiv.2311.05608

work page doi:10.48550/arxiv.2311.05608 2025

[19] [19]

Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models

Guo, Q., Pang, S., Jia, X., Liu, Y., Guo, Q., 2024. Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models. IEEE Transactions on Information Forensics and Security doi:10.1109/TIFS.2024.3518072

work page doi:10.1109/tifs.2024.3518072 2024

[20] [20]

Denoising Diffusion Probabilistic Models

Ho, J., Jain, A., Abbeel, P., 2020. Denoising diffusion probabilistic models, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2006.11239

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2006.11239 2020

[21] [21]

ScoreAdv: Score-based targeted generation of natural adversarial examples via diffusion models

Huang, C., Tang, H., 2025. ScoreAdv: Score-based targeted generation of natural adversarial examples via diffusion models. arXiv preprint arXiv:2507.06078 doi:10.48550/arXiv.2507.06078

work page doi:10.48550/arxiv.2507.06078 2025

[22] [22]

A survey of safety and trustworthiness of large language models through the lens of verification and validation

Huang, X., Ruan, W., Huang, W., Jin, G., Dong, Y., Wu, C., Bensalem, S., Mu, R., Qi, Y., Zhao, X., Cai, K., Zhang, Y., Wu, S., Xu, P., Wu, D., Freitas, A., Mustafa, M.A., 2024. A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review doi:10.1007/s10462-024-10884-2

work page doi:10.1007/s10462-024-10884-2 2024

[23] [23]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan,H.,Upasani,K.,Chi,J.,Rungta,R.,Iyer,K.,Mao,Y.,Tontchev,M.,Hu,Q.,Fuller,B.,Testuggine,D.,Khabsa,M.,2023. Llamaguard: LLM-based input-output safeguard for human-AI conversations. arXiv preprint arXiv:2312.06674 doi:10.48550/arXiv.2312.06674

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.06674 2023

[24] [24]

Redteaminglargelanguagemodels:Acomprehensivereviewandcriticalanalysis

Jabbar,M.S.,Al-Azani,S.,Alotaibi,A.,Ahmed,M.,2025. Redteaminglargelanguagemodels:Acomprehensivereviewandcriticalanalysis. Preprint submitted to Elsevier

2025

[25] [25]

Enhancing diffusion-based unrestricted adversarial attacks via adversary preferences alignment

Jiang, K., Chen, Z., Guo, H., Li, J., Fu, J., Guo, P., Tang, H., Li, B., Zhang, W., 2025. Enhancing diffusion-based unrestricted adversarial attacks via adversary preferences alignment. arXiv preprint arXiv:2506.01511 doi:10.48550/arXiv.2506.01511

work page doi:10.48550/arxiv.2506.01511 2025

[26] [26]

Is BERT really robust? a strong baseline for natural language attack on text classification and entailment (TextFooler)

Jin, D., Jin, Z., Zhou, J.T., Szolovits, P., 2020. Is BERT really robust? a strong baseline for natural language attack on text classification and entailment (TextFooler). Proceedings of the AAAI Conference on Artificial Intelligence doi:10.1609/aaai.v34i05.6311. A. Alotaibi and M. Ahmed:Preprint submitted to ElsevierPage 28 of 30 Adversarial Diffusion Ac...

work page doi:10.1609/aaai.v34i05.6311 2020

[27] [27]

DiffAttack: Evasion attacks against diffusion-based adversarial purification, in: Advances in Neural Information Processing Systems

Kang, M., Song, D., Li, B., 2024. DiffAttack: Evasion attacks against diffusion-based adversarial purification, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2311.16124

work page doi:10.48550/arxiv.2311.16124 2024

[28] [28]

The measurement of observer agreement for categorical data

Landis, J.R., Koch, G.G., 1977. The measurement of observer agreement for categorical data. Biometrics 33, 159–174

1977

[29] [29]

Textadversarialpurificationasdefenseagainstadversarialattacks,in:Proceedingsofthe61stAnnualMeeting of the Association for Computational Linguistics (ACL), pp

Li,L.,Song,D.,Qiu,X.,2023. Textadversarialpurificationasdefenseagainstadversarialattacks,in:Proceedingsofthe61stAnnualMeeting of the Association for Computational Linguistics (ACL), pp. 338–350. doi:10.18653/v1/2023.acl-long.20

work page doi:10.18653/v1/2023.acl-long.20 2023

[30] [30]

DiffAttack-X: An effective transferable adversarial attack based on diffusion models

Li, L., Zhang, X., Wang, J., et al., 2025a. DiffAttack-X: An effective transferable adversarial attack based on diffusion models. Applied Intelligence 55, 1062. doi:10.1007/s10489-025-06957-6

work page doi:10.1007/s10489-025-06957-6

[31] [31]

Imagesareachilles’heelofalignment:Exploitingvisualvulnerabilitiesforjailbreaking multimodal large language models, in: European Conference on Computer Vision (ECCV)

Li,Y.,Guo,H.,Zhou,K.,Zhao,W.X.,Wen,J.R.,2024. Imagesareachilles’heelofalignment:Exploitingvisualvulnerabilitiesforjailbreaking multimodal large language models, in: European Conference on Computer Vision (ECCV). doi:10.48550/arXiv.2403.09792

work page doi:10.48550/arxiv.2403.09792 2024

[32] [32]

doi:10.48550/arXiv

Li, Z., Nie, Z., Zhou, Z., Liu, Y., Zhang, Y., Cheng, Y., Wen, Q., Wang, K., Guo, Y., Zhang, J., 2026. DiffuGuard: How intrinsic safety is lostandfoundindiffusionlargelanguagemodels,in:InternationalConferenceonLearningRepresentations(ICLR). doi:10.48550/arXiv. 2509.24296

work page internal anchor Pith review doi:10.48550/arxiv 2026

[33] [33]

DiffuseDef: Improved robustness to adversarial attacks via iterative denoising, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

Li, Z., Zhou, H., Rei, M., Specia, L., 2025b. DiffuseDef: Improved robustness to adversarial attacks via iterative denoising, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). doi:10.48550/arXiv.2407.00248

work page doi:10.48550/arxiv.2407.00248

[34] [34]

Againsttheachilles’ heel: A survey on red teaming for generative models

Lin,L.,Mu,H.,Zhai,Z.,Wang,M.,Wang,Y.,Wang,R.,Gao,J.,Zhang,Y.,Che,W.,Baldwin,T.,Han,X.,Li,H.,2025. Againsttheachilles’ heel: A survey on red teaming for generative models. Journal of Artificial Intelligence Research doi:10.48550/arXiv.2404.00629

work page doi:10.48550/arxiv.2404.00629 2025

[35] [35]

Microsoft COCO: Common objects in context, in: European Conference on Computer Vision (ECCV)

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft COCO: Common objects in context, in: European Conference on Computer Vision (ECCV)

2014

[36] [36]

Adv-diffusion:Imperceptibleadversarialfaceidentityattackvialatentdiffusion model, in: Proceedings of the AAAI Conference on Artificial Intelligence

Liu,D.,Wang,X.,Peng,C.,Wang,N.,Hu,R.,Gao,X.,2024a. Adv-diffusion:Imperceptibleadversarialfaceidentityattackvialatentdiffusion model, in: Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.1609/aaai.v38i4.28067

work page doi:10.1609/aaai.v38i4.28067

[37] [37]

DiffProtect: Generate adversarial examples with diffusion models for facial privacy protection

Liu, J., Lau, C.P., Chellappa, R., 2023. DiffProtect: Generate adversarial examples with diffusion models for facial privacy protection. arXiv preprint arXiv:2305.13625 doi:10.48550/arXiv.2305.13625

work page doi:10.48550/arxiv.2305.13625 2023

[38] [38]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Liu, X., Xu, N., Chen, M., Xiao, C., 2024b. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models, in: International Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2310.04451

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.04451

[39] [39]

MM-SafetyBench:Abenchmarkforsafetyevaluationofmultimodallargelanguage models, in: European Conference on Computer Vision (ECCV)

Liu,X.,Zhu,Y.,Gu,J.,Lan,Y.,Yang,C.,Qiao,Y.,2024c. MM-SafetyBench:Abenchmarkforsafetyevaluationofmultimodallargelanguage models, in: European Conference on Computer Vision (ECCV). doi:10.48550/arXiv.2311.17600

work page doi:10.48550/arxiv.2311.17600

[40] [40]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., Liu, Y., 2024d. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 doi:10.48550/arXiv.2305.13860

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.13860

[41] [41]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Lou, A., Meng, C., Ermon, S., 2024. Discrete diffusion modeling by estimating the ratios of the data distribution (SEDD), in: International Conference on Machine Learning (ICML). doi:10.48550/arXiv.2310.16834

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.16834 2024

[42] [42]

DiffusionLLMsarenaturaladversariesforanyLLM

Lüdke,D.,Wollschläger,T.,Ungermann,P.,Günnemann,S.,Schwinn,L.,2025. DiffusionLLMsarenaturaladversariesforanyLLM. arXiv preprint arXiv:2511.00203 doi:10.48550/arXiv.2511.00203

work page doi:10.48550/arxiv.2511.00203 2025

[43] [43]

Visual-roleplay:Universaljailbreakattackonmultimodallargelanguage models via role-playing image character

Ma,S.,Luo,W.,Wang,Y.,Liu,X.,Chen,M.,Li,B.,Xiao,C.,2024. Visual-roleplay:Universaljailbreakattackonmultimodallargelanguage models via role-playing image character. arXiv preprint arXiv:2405.20773 doi:10.48550/arXiv.2405.20773

work page doi:10.48550/arxiv.2405.20773 2024

[44] [44]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika,M.,Phan,L.,Yin,X.,Zou,A.,Wang,Z.,Mu,N.,Sakhaee,E.,Li,N.,Basart,S.,Li,B.,Forsyth,D.,Hendrycks,D.,2024.HarmBench: Astandardizedevaluationframeworkforautomatedredteamingandrobustrefusal,in:InternationalConferenceonMachineLearning(ICML). doi:10.48550/arXiv.2402.04249

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.04249 2024

[45] [45]

FLIRT:Feedbackloopin-context red teaming

Mehrabi,N.,Goyal,P.,Dupuy,C.,Hu,Q.,Ghosh,S.,Zemel,R.,Chang,K.W.,Galstyan,A.,Gupta,R.,2023. FLIRT:Feedbackloopin-context red teaming. arXiv preprint arXiv:2308.04265

arXiv 2023

[46] [46]

AdvLogo: Adversarial patch attack against object detectors based on diffusion models

Miao, B., Li, C., Zhu, Y., Sun, W., Wang, Z., Wang, X., Xie, C., 2024. AdvLogo: Adversarial patch attack against object detectors based on diffusion models. arXiv preprint arXiv:2409.07002 doi:10.48550/arXiv.2409.07002

work page doi:10.48550/arxiv.2409.07002 2024

[47] [47]

GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models, in: International Conference on Machine Learning (ICML)

Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M., 2022. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models, in: International Conference on Machine Learning (ICML). doi:10.48550/ arXiv.2112.10741

Pith/arXiv arXiv 2022

[48] [48]

Large Language Diffusion Models

Nie,S.,Zhu,F.,You,Z.,Zhang,X.,Ou,J.,Hu,J.,Zhou,J.,Lin,Y.,Wen,J.R.,Li,C.,2025. Largelanguagediffusionmodels(LLaDA). arXiv preprint arXiv:2502.09992 doi:10.48550/arXiv.2502.09992

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.09992 2025

[49] [49]

Diffusion models for adversarial purification, in: Proceedings of the 39th International Conference on Machine Learning (ICML)

Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., Anandkumar, A., 2022. Diffusion models for adversarial purification, in: Proceedings of the 39th International Conference on Machine Learning (ICML)

2022

[50] [50]

Text-diffusionred-teamingoflargelanguagemodels:Unveilingharmfulbehaviorswithproximity constraints, in: Proceedings of the AAAI Conference on Artificial Intelligence

Nöther,J.,Singla,A.,Radanović,G.,2025. Text-diffusionred-teamingoflargelanguagemodels:Unveilingharmfulbehaviorswithproximity constraints, in: Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.48550/arXiv.2501.08246

work page doi:10.48550/arxiv.2501.08246 2025

[51] [51]

The 18 PRISMA 2020 statement: An updated guideline for reporting systematic reviews

Page,M.J.,McKenzie,J.E.,Bossuyt,P.M.,Boutron,I.,Hoffmann,T.C.,Mulrow,C.D.,Shamseer,L.,Tetzlaff,J.M.,Akl,E.A.,Brennan,S.E., et al., 2021. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372. doi:10.1136/bmj.n71

work page doi:10.1136/bmj.n71 2021

[52] [52]

AdvPrompter: Fast adaptive adversarial prompting for LLMs

Paulus, A., Zharmagambetov, A., Guo, C., Amos, B., Tian, Y., 2024. AdvPrompter: Fast adaptive adversarial prompting for LLMs. arXiv preprint arXiv:2404.16873 doi:10.48550/arXiv.2404.16873

work page doi:10.48550/arxiv.2404.16873 2024

[53] [53]

Red Teaming Language Models with Language Models

Perez,E.,Huang,S.,Song,F.,Cai,T.,Ring,R.,Aslanides,J.,Glaese,A.,McAleese,N.,Irving,G.,2022. Redteaminglanguagemodelswith language models. arXiv preprint arXiv:2202.03286 doi:10.48550/arXiv.2202.03286

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2202.03286 2022

[54] [54]

Visualadversarialexamplesjailbreakalignedlargelanguagemodels, in: Proceedings of the AAAI Conference on Artificial Intelligence

Qi,X.,Huang,K.,Panda,A.,Henderson,P.,Wang,M.,Mittal,P.,2024. Visualadversarialexamplesjailbreakalignedlargelanguagemodels, in: Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.48550/arXiv.2306.13213

work page doi:10.48550/arxiv.2306.13213 2024

[55] [55]

DiffusionmodelforadversarialattackagainstNLPmodels,in:ProceedingsoftheSPIEVol.13105(ICCAID 2023)

Qiu,S.,Gou,M.,Liang,T.,2024. DiffusionmodelforadversarialattackagainstNLPmodels,in:ProceedingsoftheSPIEVol.13105(ICCAID 2023). doi:10.1117/12.3026312

work page doi:10.1117/12.3026312 2024

[56] [56]

Gradient-based jailbreak images for multimodal fusion models

Rando, J., Korevaar, H., Brinkman, E., Evtimov, I., Tramèr, F., 2024. Gradient-based jailbreak images for multimodal fusion models. arXiv preprint arXiv:2410.03489 doi:10.48550/arXiv.2410.03489. A. Alotaibi and M. Ahmed:Preprint submitted to ElsevierPage 29 of 30 Adversarial Diffusion Across Modalities

work page doi:10.48550/arxiv.2410.03489 2024

[57] [57]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Robey, A., Wong, E., Hassani, H., Pappas, G.J., 2023. SmoothLLM: Defending large language models against jailbreaking attacks, in: arXiv preprint arXiv:2310.03684. doi:10.48550/arXiv.2310.03684

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.03684 2023

[58] [58]

High-Resolution Image Synthesis with Latent Diffusion Models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.48550/arXiv.2112.10752

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.10752 2022

[59] [59]

Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models, in: International Conference on Learning Representations (ICLR)

Shayegani, E., Dong, Y., Abu-Ghazaleh, N., 2024. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models, in: International Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2307.14539

work page doi:10.48550/arxiv.2307.14539 2024

[60] [60]

Simplified and generalized masked diffusion for discrete data (MDLM)

Shi, J., Han, K., Wang, Z., Doucet, A., Titsias, M., 2024. Simplified and generalized masked diffusion for discrete data (MDLM). arXiv preprint arXiv:2406.04329 doi:10.48550/arXiv.2406.04329

work page doi:10.48550/arxiv.2406.04329 2024

[61] [61]

Denoising Diffusion Implicit Models

Song,J.,Meng,C.,Ermon,S.,2021a. Denoisingdiffusionimplicitmodels,in:InternationalConferenceonLearningRepresentations(ICLR). doi:10.48550/arXiv.2010.02502

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.02502 2010

[62] [62]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B., 2021b. Score-based generative modeling through stochastic differential equations, in: International Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2011.13456

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2011.13456 2011

[63] [63]

DiffAM: Diffusion-based adversarial makeup transfer for facial privacy protection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Sun, Y., Yu, L., Xie, H., Li, J., Zhang, Y., 2024. DiffAM: Diffusion-based adversarial makeup transfer for facial privacy protection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.48550/arXiv.2405.09882

work page doi:10.48550/arxiv.2405.09882 2024

[64] [64]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al., 2023. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 doi:10.48550/arXiv.2307.09288

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023

[65] [65]

DiffusionAttacker: Diffusion-driven prompt manipulation for LLM jailbreak

Wang, H., Li, H., Zhu, J., Wang, X., Pan, C., Huang, M., Sha, L., 2024a. DiffusionAttacker: Diffusion-driven prompt manipulation for LLM jailbreak. arXiv preprint arXiv:2412.17522 doi:10.48550/arXiv.2412.17522

work page doi:10.48550/arxiv.2412.17522

[66] [66]

IDEATOR: Jailbreaking and benchmarking large vision-language models using themselves, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Wang, R., Li, J., Wang, Y., Wang, B., Wang, X., Teng, Y., Wang, Y., Ma, X., Jiang, Y.G., 2025a. IDEATOR: Jailbreaking and benchmarking large vision-language models using themselves, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). doi:10.48550/arXiv.2411.00827

work page doi:10.48550/arxiv.2411.00827

[67] [67]

White-box multimodal jailbreaks against large vision-language models, in: Proceedings of the 32nd ACM International Conference on Multimedia (MM)

Wang, R., Ma, X., Zhou, H., Ji, C., Ye, G., Jiang, Y.G., 2024b. White-box multimodal jailbreaks against large vision-language models, in: Proceedings of the 32nd ACM International Conference on Multimedia (MM). doi:10.48550/arXiv.2405.17894

work page doi:10.48550/arxiv.2405.17894

[68] [68]

Latent-space diffusion models for stealthy and transferable adversarial attacks on object detection

Wang, W., Qi, H., Huang, Z., Yin, B., et al., 2025b. Latent-space diffusion models for stealthy and transferable adversarial attacks on object detection. Neurocomputing 656, 131456. doi:10.1016/j.neucom.2025.131456

work page doi:10.1016/j.neucom.2025.131456 2025

[69] [69]

InstructTA:Instruction-tunedtargetedattackforlargevision-languagemodels

Wang,X.,Ji,Z.,Ma,P.,Li,Z.,Wang,S.,2023. InstructTA:Instruction-tunedtargetedattackforlargevision-languagemodels. arXivpreprint arXiv:2312.01886 doi:10.48550/arXiv.2312.01886

work page doi:10.48550/arxiv.2312.01886 2023

[70] [70]

BadPatch:Diffusion-basedgenerationofphysicaladversarialpatches

Wang,Z.,Ma,X.,Jiang,Y.G.,2024c. BadPatch:Diffusion-basedgenerationofphysicaladversarialpatches. arXivpreprintarXiv:2412.01440 doi:10.48550/arXiv.2412.01440

work page doi:10.48550/arxiv.2412.01440

[71] [71]

Jailbroken: How Does LLM Safety Training Fail?

Wei, A., Haghtalab, N., Steinhardt, J., 2023. Jailbroken: How does LLM safety training fail?, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2307.02483

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.02483 2023

[72] [72]

The devil behind the mask: An emergent safety vulnerability of diffusion LLMs, in: International Conference on Learning Representations (ICLR)

Wen, Z., Qu, J., Chen, Z., Lu, X., Liu, D., Liu, Z., Wu, R., Yang, Y., Jin, X., Xu, H., Liu, X., Li, W., Lu, C., Shao, J., He, C., Zhang, L., 2026. The devil behind the mask: An emergent safety vulnerability of diffusion LLMs, in: International Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2507.11097

work page doi:10.48550/arxiv.2507.11097 2026

[73] [73]

Xu,W.,Chen,K.,Gao,Z.,Wei,Z.,Chen,J.,Jiang,Y.G.,2024.Highlytransferablediffusion-basedunrestrictedadversarialattackonpre-trained vision-languagemodels,in:Proceedingsofthe32ndACMInternationalConferenceonMultimedia(MM).doi:10.1145/3664647.3681538

work page doi:10.1145/3664647.3681538 2024

[74] [74]

Diff-PGD: Diffusion-based adversarial sample generation for improved stealthiness and controllability, in: Advances in Neural Information Processing Systems

Xue, H., Araujo, A., Hu, B., Chen, Y., 2023. Diff-PGD: Diffusion-based adversarial sample generation for improved stealthiness and controllability, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2305.16494

work page doi:10.48550/arxiv.2305.16494 2023

[75] [75]

Toward safer diffusion language models: Discovery and mitigation of priming vulnerability, in: International Conference on Learning Representations (ICLR)

Yamabe, S., Sakuma, J., 2026. Toward safer diffusion language models: Discovery and mitigation of priming vulnerability, in: International Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2510.00565

work page doi:10.48550/arxiv.2510.00565 2026

[76] [76]

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Yi, S., Liu, Y., Sun, Z., Cong, T., He, X., Song, J., Xu, K., Li, Q., 2024. Jailbreak attacks and defenses against large language models: A survey. arXiv preprint arXiv:2407.04295 doi:10.48550/arXiv.2407.04295

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.04295 2024

[77] [77]

Jailbreak vision language models via bi-modal adversarial prompt, in: Advances in Neural Information Processing Systems

Ying, Z., Liu, A., Zhang, T., Yu, Z., Liang, S., Liu, X., Tao, D., 2024. Jailbreak vision language models via bi-modal adversarial prompt, in: Advances in Neural Information Processing Systems. doi:10.48550/arXiv.2406.04031

work page doi:10.48550/arxiv.2406.04031 2024

[78] [78]

AdvI2I: Adversarial image attack on image-to-image diffusion models, in: International Conference on Machine Learning (ICML)

Zeng, Y., Cao, Y., Cao, B., Chang, Y., Chen, J., Lin, L., 2025. AdvI2I: Adversarial image attack on image-to-image diffusion models, in: International Conference on Machine Learning (ICML). doi:10.48550/arXiv.2410.21471

work page doi:10.48550/arxiv.2410.21471 2025

[79] [79]

Zhang,J.,Ye,J.,Ma,X.,Li,Y.,Yang,Y.,Chen,Y.,Sang,J.,Yeung,D.Y.,2025a. AnyAttack:Towardslarge-scaleself-supervisedadversarial attacks on vision-language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.48550/arXiv.2410.05346

work page doi:10.48550/arxiv.2410.05346

[80] [80]

Jailbreaking large language diffusion models: Revealing hidden safety flaws in diffusion-based text generation

Zhang, Y., Xie, F., Zhou, Z., Li, Z., Chen, H., Wang, K., Guo, Y., 2025b. Jailbreaking large language diffusion models: Revealing hidden safety flaws in diffusion-based text generation. arXiv preprint arXiv:2507.19227 doi:10.48550/arXiv.2507.19227

work page doi:10.48550/arxiv.2507.19227