pith. sign in

arxiv: 2501.13340 · v4 · submitted 2025-01-23 · 💻 cs.CV

Retrievals Can Be Detrimental: Unveiling the Backdoor Vulnerability of Retrieval-Augmented Diffusion Models

Pith reviewed 2026-05-23 05:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords backdoor attackretrieval-augmented diffusion modelBadRDMcontrastive learningtext triggertoxicity surrogatemodel vulnerabilitydiffusion model security
0
0 comments X

The pith

Retrieval-augmented diffusion models can be backdoored to generate toxic content from specific text triggers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that retrieval-augmented diffusion models remain open to backdoor attacks even though they reduce training costs. Attackers insert a few toxic images into the retrieval database and use a modified contrastive learning method to make the retriever link certain text triggers to those images. When a trigger appears at generation time, the model retrieves the toxic images and incorporates their features into the output. This matters because many systems now rely on RAG to boost diffusion models without large datasets, yet the same mechanism creates a new attack surface. If the attack works, it means security must be considered alongside the efficiency gains of retrieval augmentation.

Core claim

The paper claims that retrieval-augmented diffusion models are susceptible to backdoor attacks. It introduces BadRDM, which inserts toxicity surrogates into the database and applies a malicious contrastive learning variant to the retriever so that text triggers cause retrieval of those surrogates. This controls the generated contents while the diffusion model itself stays unchanged. Experiments on mainstream tasks confirm strong attack success with little impact on normal performance.

What carries the argument

BadRDM, which uses multimodal contrastive learning to create shortcuts from text triggers to inserted toxicity surrogates in the retrieval database.

If this is right

  • The attack preserves the model's benign utility on non-trigger inputs.
  • Entropy-based selection and generative augmentation strategies yield more effective toxicity surrogates.
  • Backdoors can be injected into the retriever independently of the diffusion generation process.
  • Outstanding attack effects are shown on two mainstream tasks such as text-to-image generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other retrieval-augmented generation systems may share similar vulnerabilities if they rely on contrastive retrievers.
  • Secure deployment of RDMs would require additional checks on retrieved items or retriever robustness.
  • Attackers could extend this to target specific content types beyond toxicity.

Load-bearing premise

A malicious variant of contrastive learning can build reliable shortcuts from text triggers to the inserted toxicity surrogates without harming normal retrieval accuracy.

What would settle it

Running the attack and then checking whether images generated from trigger prompts consistently contain the toxic features or whether the retriever ranks the surrogates highly for those triggers.

Figures

Figures reproduced from arXiv: 2501.13340 by Bin Chen, Hao Fang, Hongyao Yu, Jiawei Kong, Kuofeng Gao, Shu-Tao Xia, Sijin Yu, Xiaohang Sui.

Figure 1
Figure 1. Figure 1: Illustration of the proposed BadRDM. For clean inputs [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed BadRDM. Firstly, we leverage minimal-entropy selection and generative augmentation for class [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: We calculate ASR as the proportion of synthesized im [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization results of our BadRDM and the Clean RDM on class-specific and text-specific attacks. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation studies of BadRDM on text-to-image synthesis regarding three critical hyperparameters. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Diffusion models (DMs) have recently demonstrated remarkable generation capability. However, their training generally requires huge computational resources and large-scale datasets. To solve these, recent studies empower DMs with the advanced Retrieval-Augmented Generation (RAG) technique and propose retrieval-augmented diffusion models (RDMs). By incorporating rich knowledge from an auxiliary database, RAG enhances diffusion models' generation and generalization ability while significantly reducing model parameters. Despite the great success, RAG may introduce novel security issues that warrant further investigation. In this paper, we reveal that the RDM is susceptible to backdoor attacks by proposing a multimodal contrastive attack approach named BadRDM. Our framework fully considers RAG's characteristics and is devised to manipulate the retrieved items for given text triggers, thereby further controlling the generated contents. Specifically, we first insert a tiny portion of images into the retrieval database as target toxicity surrogates. Subsequently, a malicious variant of contrastive learning is adopted to inject backdoors into the retriever, which builds shortcuts from triggers to the toxicity surrogates. Furthermore, we enhance the attacks through novel entropy-based selection and generative augmentation strategies that can derive better toxicity surrogates. Extensive experiments on two mainstream tasks demonstrate the proposed BadRDM achieves outstanding attack effects while preserving the model's benign utility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that retrieval-augmented diffusion models (RDMs) are vulnerable to backdoor attacks via a proposed multimodal contrastive method called BadRDM. The attack inserts a small set of toxicity surrogate images into the retrieval database, then applies a malicious variant of contrastive learning to the retriever to create shortcuts from text triggers to these surrogates. Entropy-based selection and generative augmentation are used to improve surrogate quality. Experiments on two mainstream tasks are reported to achieve high attack success rates while preserving benign model utility.

Significance. If the quantitative results hold, the work is significant because it identifies a practical security vulnerability that exploits the retrieval component of RDMs, which is otherwise presented as an efficiency advantage. The explicit attack construction (trigger insertion, contrastive poisoning, surrogate strategies) and reported metrics on attack success versus clean utility provide a concrete, falsifiable demonstration that could guide defenses in retrieval-augmented generative systems.

major comments (2)
  1. [Attack pipeline] § on attack pipeline (retriever poisoning): the central claim that the contrastive shortcuts operate independently of the diffusion model requires an ablation showing attack success when the diffusion backbone is replaced or frozen; without this, the independence assumption remains untested and load-bearing for the 'RAG-specific' vulnerability narrative.
  2. [Experiments] Experimental results section, attack success table: the reported 'outstanding attack effects' must be accompanied by explicit baseline comparisons (e.g., random retrieval, direct diffusion poisoning, or non-contrastive trigger insertion) and precise definitions of success rate and utility metrics; absence of these undermines the claim that BadRDM is superior while preserving utility.
minor comments (2)
  1. [Abstract] Abstract: the two 'mainstream tasks' are not named; specify them (e.g., text-to-image, inpainting) for immediate clarity.
  2. [Notation] Notation: ensure RDM, RAG, and surrogate terminology are defined on first use and used consistently; minor inconsistencies in acronym expansion appear in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Attack pipeline] § on attack pipeline (retriever poisoning): the central claim that the contrastive shortcuts operate independently of the diffusion model requires an ablation showing attack success when the diffusion backbone is replaced or frozen; without this, the independence assumption remains untested and load-bearing for the 'RAG-specific' vulnerability narrative.

    Authors: We agree that an explicit ablation is needed to strengthen the claim of retriever-specific vulnerability. In the revised manuscript, we will add experiments freezing the diffusion backbone and replacing it with an alternative model (e.g., a different pre-trained DM), demonstrating that attack success rates remain comparable while clean utility is preserved. This will be placed in the attack pipeline section. revision: yes

  2. Referee: [Experiments] Experimental results section, attack success table: the reported 'outstanding attack effects' must be accompanied by explicit baseline comparisons (e.g., random retrieval, direct diffusion poisoning, or non-contrastive trigger insertion) and precise definitions of success rate and utility metrics; absence of these undermines the claim that BadRDM is superior while preserving utility.

    Authors: We acknowledge the need for clearer baselines and metric definitions. The original experiments include some implicit comparisons, but we will expand the results section to explicitly include baselines such as random retrieval, direct poisoning of the diffusion model, and non-contrastive trigger insertion. We will also add precise definitions: attack success rate as the percentage of triggered generations exhibiting toxicity (measured via a toxicity classifier), and utility via standard metrics like FID on clean prompts. These additions will be included in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is an empirical security paper that describes a procedural attack construction (trigger insertion, entropy-based surrogate selection, contrastive poisoning of the retriever) and reports experimental results on two tasks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim rests on the observable success of the described attack pipeline rather than any self-referential reduction or imported uniqueness theorem. This is the normal case for an attack paper and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical security paper; it introduces no new mathematical free parameters, relies on standard machine learning training assumptions, and postulates no new physical or theoretical entities.

axioms (1)
  • domain assumption The retriever component in RDMs can be independently trained or fine-tuned with a contrastive objective without altering the downstream diffusion generation process.
    The attack construction depends on the separability of the retrieval and generation stages as described in the RDM architecture.

pith-pipeline@v0.9.0 · 5790 in / 1322 out tokens · 43218 ms · 2026-05-23T05:08:49.990144+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 6 internal anchors

  1. [1]

    Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning

    Hritik Bansal, Nishad Singhi, Yu Yang, Fan Yin, Aditya Grover, and Kai-Wei Chang. Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 112–123, 2023. 8

  2. [2]

    Retrieval-augmented diffusion models

    Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas M¨uller, and Bj ¨orn Ommer. Retrieval-augmented diffusion models. Advances in Neural Information Processing Sys- tems, 35:15309–15324, 2022. 1, 2, 3, 6, 8

  3. [3]

    Improving language models by retriev- ing from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retriev- ing from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022. 2

  4. [4]

    Instance- conditioned gan

    Arantxa Casanova, Marlene Careil, Jakob Verbeek, Michal Drozdzal, and Adriana Romero Soriano. Instance- conditioned gan. Advances in Neural Information Process- ing Systems, 34:27517–27529, 2021. 2

  5. [5]

    Phan- tom: General trigger attacks on retrieval augmented lan- guage generation

    Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, and Alina Oprea. Phantom: General trigger attacks on retrieval augmented language generation. arXiv preprint arXiv:2405.20485, 2024. 3

  6. [6]

    Label-retrieval- augmented diffusion models for learning from noisy labels

    Jian Chen, Ruiyi Zhang, Tong Yu, Rohan Sharma, Zhiqiang Xu, Tong Sun, and Changyou Chen. Label-retrieval- augmented diffusion models for learning from noisy labels. Advances in Neural Information Processing Systems , 36,

  7. [7]

    Re-imagen: Retrieval-augmented text-to-image gen- erator

    Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to-image gen- erator. arXiv preprint arXiv:2209.14491, 2022. 2, 3

  8. [8]

    Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

    Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 ,

  9. [9]

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. Advances in Neural Informa- tion Processing Systems, 37:130185–130213, 2025. 3

  10. [10]

    Trojan- rag: Retrieval-augmented generation can be backdoor driver in large language models

    Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojan- rag: Retrieval-augmented generation can be backdoor driver in large language models. arXiv preprint arXiv:2405.13401,

  11. [11]

    How to backdoor diffusion models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4015–4024, 2023

    Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. How to backdoor diffusion models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4015–4024, 2023. 2, 3

  12. [12]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 6

  13. [13]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 2

  14. [14]

    Cpr: Retrieval augmented generation for copyright protec- tion

    Aditya Golatkar, Alessandro Achille, Luca Zancato, Yu- Xiang Wang, Ashwin Swaminathan, and Stefano Soatto. Cpr: Retrieval augmented generation for copyright protec- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 12374–12384,

  15. [15]

    Badnets: Evaluating backdooring attacks on deep neu- ral networks

    Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neu- ral networks. IEEE Access, 7:47230–47244, 2019. 3

  16. [16]

    Retrieval augmented language model pre- training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre- training. In International conference on machine learning , pages 3929–3938. PMLR, 2020. 2

  17. [17]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6

  18. [18]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 6

  19. [19]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 2

  20. [20]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 1, 2

  21. [21]

    Densely connected convolutional net- works

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. 6

  22. [22]

    Nearest neighbor machine transla- tion

    Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettle- moyer, and Mike Lewis. Nearest neighbor machine transla- tion. arXiv preprint arXiv:2010.00710, 2020. 2

  23. [23]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Interna- tional journal of computer vision , 128(7):1956–1981, 2020. 6

  24. [24]

    The role of imagenet classes in fr \’echet inception distance

    Tuomas Kynk ¨a¨anniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of imagenet classes in fr \’echet inception distance. arXiv preprint arXiv:2203.06026, 2022. 6

  25. [25]

    Con- trastive representation learning: A framework and review

    Phuc H Le-Khac, Graham Healy, and Alan F Smeaton. Con- trastive representation learning: A framework and review. Ieee Access, 8:193907–193934, 2020. 5

  26. [26]

    Backdoors against natural language processing: A review

    Shaofeng Li, Tian Dong, Benjamin Zi Hao Zhao, Minhui Xue, Suguo Du, and Haojin Zhu. Backdoors against natural language processing: A review. IEEE Security & Privacy , 20(5):50–59, 2022. 3

  27. [27]

    Back- door learning: A survey

    Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Back- door learning: A survey. IEEE Transactions on Neural Net- works and Learning Systems, 35(1):5–22, 2022. 3

  28. [28]

    Badclip: Dual- embedding guided backdoor attack on multimodal con- trastive learning

    Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, and Ee-Chien Chang. Badclip: Dual- embedding guided backdoor attack on multimodal con- trastive learning. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 24645–24654, 2024. 8

  29. [29]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 6

  30. [30]

    Retrieval-augmented diffusion models for time series fore- casting

    Jingwei Liu, Ling Yang, Hongyan Li, and Shenda Hong. Retrieval-augmented diffusion models for time series fore- casting. arXiv preprint arXiv:2410.18712, 2024. 3

  31. [31]

    More control for free! im- age synthesis with semantic diffusion guidance

    Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! im- age synthesis with semantic diffusion guidance. In Proceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023. 2

  32. [32]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems , 35:5775–5787,

  33. [33]

    Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models

    Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 102–111,

  34. [34]

    Gnn-lm: Language mod- eling based on global contexts via gnn

    Yuxian Meng, Shi Zong, Xiaoya Li, Xiaofei Sun, Tianwei Zhang, Fei Wu, and Jiwei Li. Gnn-lm: Language mod- eling based on global contexts via gnn. arXiv preprint arXiv:2110.08743, 2021. 1, 2

  35. [35]

    Towards trustworthy re- trieval augmented generation for large language models: A survey

    Bo Ni, Zheyuan Liu, Leyao Wang, Yongjia Lei, Yuying Zhao, Xueqi Cheng, Qingkai Zeng, Luna Dong, Yinglong Xia, Krishnaram Kenthapadi, et al. Towards trustworthy re- trieval augmented generation for large language models: A survey. arXiv preprint arXiv:2502.06872, 2025. 1

  36. [36]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 2

  37. [37]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR,

  38. [38]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2

  39. [39]

    The devil is in the gan: backdoor attacks and defenses in deep generative models

    Ambrish Rawat, Killian Levacher, and Mathieu Sinn. The devil is in the gan: backdoor attacks and defenses in deep generative models. In European Symposium on Research in Computer Security, pages 776–783. Springer, 2022. 3

  40. [40]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2, 6

  41. [41]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 2

  42. [42]

    Baaan: Backdoor attacks against autoencoder and gan-based machine learning models

    Ahmed Salem, Yannick Sautter, Michael Backes, Mathias Humbert, and Yang Zhang. Baaan: Backdoor attacks against autoencoder and gan-based machine learning models. arXiv preprint arXiv:2010.03007, 2020. 3

  43. [43]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022. 2

  44. [44]

    Retrieval-augmented score distillation for text-to-3d gener- ation

    Junyoung Seo, Susung Hong, Wooseok Jang, In `es Hyeonsu Kim, Minseop Kwak, Doyup Lee, and Seungryong Kim. Retrieval-augmented score distillation for text-to-3d gener- ation. arXiv preprint arXiv:2402.02972, 2024. 3

  45. [45]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 2, 6

  46. [46]

    Morag–multi-fusion retrieval aug- mented generation for human motion

    Kalakonda Sai Shashank, Shubh Maheshwari, and Ravi Ki- ran Sarvadevabhatla. Morag–multi-fusion retrieval aug- mented generation for human motion. arXiv preprint arXiv:2409.12140, 2024. 3

  47. [47]

    Knn- diffusion: Image generation via large-scale retrieval

    Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn- diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849, 2022. 1, 2, 3

  48. [48]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 1, 2

  49. [49]

    Rickrolling the artist: Injecting backdoors into text en- coders for text-to-image synthesis

    Lukas Struppek, Dominik Hintersdorf, and Kristian Kerst- ing. Rickrolling the artist: Injecting backdoors into text en- coders for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4584–4596, 2023. 3, 6, 7

  50. [50]

    On the diversity and realism of distilled dataset: An efficient dataset distilla- tion paradigm

    Peng Sun, Bei Shi, Daiwei Yu, and Tao Lin. On the diversity and realism of distilled dataset: An efficient dataset distilla- tion paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9390– 9399, 2024. 5

  51. [51]

    Retrievegan: Image synthesis via differentiable patch retrieval

    Hung-Yu Tseng, Hsin-Ying Lee, Lu Jiang, Ming-Hsuan Yang, and Weilong Yang. Retrievegan: Image synthesis via differentiable patch retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16 , pages 242–257. Springer, 2020. 2

  52. [52]

    Eviledit: Backdooring text-to-image diffusion models in one second

    Hao Wang, Shangwei Guo, Jialing He, Kangjie Chen, Shudong Zhang, Tianwei Zhang, and Tao Xiang. Eviledit: Backdooring text-to-image diffusion models in one second. In ACM Multimedia 2024, 2024. 2, 3, 6

  53. [53]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 6

  54. [54]

    Data poisoning attacks against multimodal encoders

    Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, and Yang Zhang. Data poisoning attacks against multimodal encoders. In International Con- ference on Machine Learning, pages 39299–39313. PMLR,

  55. [55]

    Text-to-image diffusion models can be easily backdoored through multimodal data poisoning

    Shengfang Zhai, Yinpeng Dong, Qingni Shen, Shi Pu, Yue- jian Fang, and Hang Su. Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 1577–1587, 2023. 2, 3, 8

  56. [56]

    Re- modiffuse: Retrieval-augmented motion diffusion model

    Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. Re- modiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 364–373, 2023. 3

  57. [57]

    Benchmarking trustworthiness of multi- modal large language models: A comprehensive study.arXiv preprint arXiv:2406.07057, 2024

    Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, et al. Benchmarking trustworthiness of multi- modal large language models: A comprehensive study.arXiv preprint arXiv:2406.07057, 2024. 6

  58. [58]

    Badcm: Invisible backdoor attack against cross-modal learning

    Zheng Zhang, Xu Yuan, Lei Zhu, Jingkuan Song, and Liqiang Nie. Badcm: Invisible backdoor attack against cross-modal learning. IEEE Transactions on Image Process- ing, 2024. 6, 7

  59. [59]

    Retrieval-Augmented Generation for AI-Generated Content: A Survey

    Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wen- tao Zhang, and Bin Cui. Retrieval-augmented genera- tion for ai-generated content: A survey. arXiv preprint arXiv:2402.19473, 2024. 1, 2