Retrievals Can Be Detrimental: Unveiling the Backdoor Vulnerability of Retrieval-Augmented Diffusion Models

Bin Chen; Hao Fang; Hongyao Yu; Jiawei Kong; Kuofeng Gao; Shu-Tao Xia; Sijin Yu; Xiaohang Sui

arxiv: 2501.13340 · v4 · submitted 2025-01-23 · 💻 cs.CV

Retrievals Can Be Detrimental: Unveiling the Backdoor Vulnerability of Retrieval-Augmented Diffusion Models

Hao Fang , Xiaohang Sui , Hongyao Yu , Kuofeng Gao , Jiawei Kong , Sijin Yu , Bin Chen , Shu-Tao Xia This is my paper

Pith reviewed 2026-05-23 05:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords backdoor attackretrieval-augmented diffusion modelBadRDMcontrastive learningtext triggertoxicity surrogatemodel vulnerabilitydiffusion model security

0 comments

The pith

Retrieval-augmented diffusion models can be backdoored to generate toxic content from specific text triggers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that retrieval-augmented diffusion models remain open to backdoor attacks even though they reduce training costs. Attackers insert a few toxic images into the retrieval database and use a modified contrastive learning method to make the retriever link certain text triggers to those images. When a trigger appears at generation time, the model retrieves the toxic images and incorporates their features into the output. This matters because many systems now rely on RAG to boost diffusion models without large datasets, yet the same mechanism creates a new attack surface. If the attack works, it means security must be considered alongside the efficiency gains of retrieval augmentation.

Core claim

The paper claims that retrieval-augmented diffusion models are susceptible to backdoor attacks. It introduces BadRDM, which inserts toxicity surrogates into the database and applies a malicious contrastive learning variant to the retriever so that text triggers cause retrieval of those surrogates. This controls the generated contents while the diffusion model itself stays unchanged. Experiments on mainstream tasks confirm strong attack success with little impact on normal performance.

What carries the argument

BadRDM, which uses multimodal contrastive learning to create shortcuts from text triggers to inserted toxicity surrogates in the retrieval database.

If this is right

The attack preserves the model's benign utility on non-trigger inputs.
Entropy-based selection and generative augmentation strategies yield more effective toxicity surrogates.
Backdoors can be injected into the retriever independently of the diffusion generation process.
Outstanding attack effects are shown on two mainstream tasks such as text-to-image generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other retrieval-augmented generation systems may share similar vulnerabilities if they rely on contrastive retrievers.
Secure deployment of RDMs would require additional checks on retrieved items or retriever robustness.
Attackers could extend this to target specific content types beyond toxicity.

Load-bearing premise

A malicious variant of contrastive learning can build reliable shortcuts from text triggers to the inserted toxicity surrogates without harming normal retrieval accuracy.

What would settle it

Running the attack and then checking whether images generated from trigger prompts consistently contain the toxic features or whether the retriever ranks the surrogates highly for those triggers.

Figures

Figures reproduced from arXiv: 2501.13340 by Bin Chen, Hao Fang, Hongyao Yu, Jiawei Kong, Kuofeng Gao, Shu-Tao Xia, Sijin Yu, Xiaohang Sui.

**Figure 2.** Figure 2: Overview of the proposed BadRDM. Firstly, we leverage minimal-entropy selection and generative augmentation for class [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: We calculate ASR as the proportion of synthesized im [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization results of our BadRDM and the Clean RDM on class-specific and text-specific attacks. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation studies of BadRDM on text-to-image synthesis regarding three critical hyperparameters. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Diffusion models (DMs) have recently demonstrated remarkable generation capability. However, their training generally requires huge computational resources and large-scale datasets. To solve these, recent studies empower DMs with the advanced Retrieval-Augmented Generation (RAG) technique and propose retrieval-augmented diffusion models (RDMs). By incorporating rich knowledge from an auxiliary database, RAG enhances diffusion models' generation and generalization ability while significantly reducing model parameters. Despite the great success, RAG may introduce novel security issues that warrant further investigation. In this paper, we reveal that the RDM is susceptible to backdoor attacks by proposing a multimodal contrastive attack approach named BadRDM. Our framework fully considers RAG's characteristics and is devised to manipulate the retrieved items for given text triggers, thereby further controlling the generated contents. Specifically, we first insert a tiny portion of images into the retrieval database as target toxicity surrogates. Subsequently, a malicious variant of contrastive learning is adopted to inject backdoors into the retriever, which builds shortcuts from triggers to the toxicity surrogates. Furthermore, we enhance the attacks through novel entropy-based selection and generative augmentation strategies that can derive better toxicity surrogates. Extensive experiments on two mainstream tasks demonstrate the proposed BadRDM achieves outstanding attack effects while preserving the model's benign utility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete poisoning attack on RDM retrievers that uses contrastive shortcuts plus entropy selection and generative augmentation, and reports quantitative success on two tasks.

read the letter

The main takeaway is that retrieval-augmented diffusion models are open to a backdoor where an attacker poisons a small slice of the database and trains the retriever to link text triggers to toxic surrogates. The attack then steers the downstream diffusion output without touching the generator itself. That framing is new for this model class and matches the RAG structure more closely than prior diffusion backdoors or standard RAG attacks. The construction includes entropy-based surrogate picking and generative augmentation to improve the inserted images, which looks like a reasonable adaptation rather than a generic transplant. The experiments claim high attack success rates on two standard tasks while keeping clean utility nearly unchanged, and the stress-test note confirms the manuscript supplies the training objective and numbers rather than just the abstract claim. That is the part worth taking seriously if the numbers hold up under review. On the soft side, the work is still an empirical attack paper, so it would be stronger with explicit comparisons to simpler poisoning baselines or to attacks that target the diffusion stage directly. There is also no discussion of easy defenses such as retrieval filtering or trigger detection, which is common in this area but leaves the practical impact a bit open. The math is light—no derivations or formal bounds—but the attack pipeline is described in enough detail to be reproducible from the text. This is the kind of paper that belongs in a security or generative-model venue. A serious editor should send it to referees because the attack is specific to RDMs, the results are quantitative, and the threat model is plausible for deployed systems. It is not a foundational result, but it flags a real surface that practitioners should know about.

Referee Report

2 major / 2 minor

Summary. The paper claims that retrieval-augmented diffusion models (RDMs) are vulnerable to backdoor attacks via a proposed multimodal contrastive method called BadRDM. The attack inserts a small set of toxicity surrogate images into the retrieval database, then applies a malicious variant of contrastive learning to the retriever to create shortcuts from text triggers to these surrogates. Entropy-based selection and generative augmentation are used to improve surrogate quality. Experiments on two mainstream tasks are reported to achieve high attack success rates while preserving benign model utility.

Significance. If the quantitative results hold, the work is significant because it identifies a practical security vulnerability that exploits the retrieval component of RDMs, which is otherwise presented as an efficiency advantage. The explicit attack construction (trigger insertion, contrastive poisoning, surrogate strategies) and reported metrics on attack success versus clean utility provide a concrete, falsifiable demonstration that could guide defenses in retrieval-augmented generative systems.

major comments (2)

[Attack pipeline] § on attack pipeline (retriever poisoning): the central claim that the contrastive shortcuts operate independently of the diffusion model requires an ablation showing attack success when the diffusion backbone is replaced or frozen; without this, the independence assumption remains untested and load-bearing for the 'RAG-specific' vulnerability narrative.
[Experiments] Experimental results section, attack success table: the reported 'outstanding attack effects' must be accompanied by explicit baseline comparisons (e.g., random retrieval, direct diffusion poisoning, or non-contrastive trigger insertion) and precise definitions of success rate and utility metrics; absence of these undermines the claim that BadRDM is superior while preserving utility.

minor comments (2)

[Abstract] Abstract: the two 'mainstream tasks' are not named; specify them (e.g., text-to-image, inpainting) for immediate clarity.
[Notation] Notation: ensure RDM, RAG, and surrogate terminology are defined on first use and used consistently; minor inconsistencies in acronym expansion appear in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Attack pipeline] § on attack pipeline (retriever poisoning): the central claim that the contrastive shortcuts operate independently of the diffusion model requires an ablation showing attack success when the diffusion backbone is replaced or frozen; without this, the independence assumption remains untested and load-bearing for the 'RAG-specific' vulnerability narrative.

Authors: We agree that an explicit ablation is needed to strengthen the claim of retriever-specific vulnerability. In the revised manuscript, we will add experiments freezing the diffusion backbone and replacing it with an alternative model (e.g., a different pre-trained DM), demonstrating that attack success rates remain comparable while clean utility is preserved. This will be placed in the attack pipeline section. revision: yes
Referee: [Experiments] Experimental results section, attack success table: the reported 'outstanding attack effects' must be accompanied by explicit baseline comparisons (e.g., random retrieval, direct diffusion poisoning, or non-contrastive trigger insertion) and precise definitions of success rate and utility metrics; absence of these undermines the claim that BadRDM is superior while preserving utility.

Authors: We acknowledge the need for clearer baselines and metric definitions. The original experiments include some implicit comparisons, but we will expand the results section to explicitly include baselines such as random retrieval, direct poisoning of the diffusion model, and non-contrastive trigger insertion. We will also add precise definitions: attack success rate as the percentage of triggered generations exhibiting toxicity (measured via a toxicity classifier), and utility via standard metrics like FID on clean prompts. These additions will be included in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is an empirical security paper that describes a procedural attack construction (trigger insertion, entropy-based surrogate selection, contrastive poisoning of the retriever) and reports experimental results on two tasks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim rests on the observable success of the described attack pipeline rather than any self-referential reduction or imported uniqueness theorem. This is the normal case for an attack paper and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical security paper; it introduces no new mathematical free parameters, relies on standard machine learning training assumptions, and postulates no new physical or theoretical entities.

axioms (1)

domain assumption The retriever component in RDMs can be independently trained or fine-tuned with a contrastive objective without altering the downstream diffusion generation process.
The attack construction depends on the separability of the retrieval and generation stages as described in the RDM architecture.

pith-pipeline@v0.9.0 · 5790 in / 1322 out tokens · 43218 ms · 2026-05-23T05:08:49.990144+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a malicious variant of contrastive learning is adopted to inject backdoors into the retriever, which builds shortcuts from triggers to the toxicity surrogates
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments on two mainstream tasks demonstrate the proposed BadRDM achieves outstanding attack effects

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 6 internal anchors

[1]

Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning

Hritik Bansal, Nishad Singhi, Yu Yang, Fan Yin, Aditya Grover, and Kai-Wei Chang. Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 112–123, 2023. 8

work page 2023
[2]

Retrieval-augmented diffusion models

Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas M¨uller, and Bj ¨orn Ommer. Retrieval-augmented diffusion models. Advances in Neural Information Processing Sys- tems, 35:15309–15324, 2022. 1, 2, 3, 6, 8

work page 2022
[3]

Improving language models by retriev- ing from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retriev- ing from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022. 2

work page 2022
[4]

Instance- conditioned gan

Arantxa Casanova, Marlene Careil, Jakob Verbeek, Michal Drozdzal, and Adriana Romero Soriano. Instance- conditioned gan. Advances in Neural Information Process- ing Systems, 34:27517–27529, 2021. 2

work page 2021
[5]

Phan- tom: General trigger attacks on retrieval augmented lan- guage generation

Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, and Alina Oprea. Phantom: General trigger attacks on retrieval augmented language generation. arXiv preprint arXiv:2405.20485, 2024. 3

work page arXiv 2024
[6]

Label-retrieval- augmented diffusion models for learning from noisy labels

Jian Chen, Ruiyi Zhang, Tong Yu, Rohan Sharma, Zhiqiang Xu, Tong Sun, and Changyou Chen. Label-retrieval- augmented diffusion models for learning from noisy labels. Advances in Neural Information Processing Systems , 36,

work page
[7]

Re-imagen: Retrieval-augmented text-to-image gen- erator

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to-image gen- erator. arXiv preprint arXiv:2209.14491, 2022. 2, 3

work page arXiv 2022
[8]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 ,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. Advances in Neural Informa- tion Processing Systems, 37:130185–130213, 2025. 3

work page 2025
[10]

Trojan- rag: Retrieval-augmented generation can be backdoor driver in large language models

Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojan- rag: Retrieval-augmented generation can be backdoor driver in large language models. arXiv preprint arXiv:2405.13401,

work page arXiv
[11]

How to backdoor diffusion models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4015–4024, 2023

Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. How to backdoor diffusion models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4015–4024, 2023. 2, 3

work page 2023
[12]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 6

work page 2009
[13]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 2

work page 2021
[14]

Cpr: Retrieval augmented generation for copyright protec- tion

Aditya Golatkar, Alessandro Achille, Luca Zancato, Yu- Xiang Wang, Ashwin Swaminathan, and Stefano Soatto. Cpr: Retrieval augmented generation for copyright protec- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 12374–12384,

work page
[15]

Badnets: Evaluating backdooring attacks on deep neu- ral networks

Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neu- ral networks. IEEE Access, 7:47230–47244, 2019. 3

work page 2019
[16]

Retrieval augmented language model pre- training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre- training. In International conference on machine learning , pages 3929–3938. PMLR, 2020. 2

work page 2020
[17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6

work page 2016
[18]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 6

work page 2017
[19]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 1, 2

work page 2020
[21]

Densely connected convolutional net- works

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. 6

work page 2017
[22]

Nearest neighbor machine transla- tion

Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettle- moyer, and Mike Lewis. Nearest neighbor machine transla- tion. arXiv preprint arXiv:2010.00710, 2020. 2

work page arXiv 2010
[23]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Interna- tional journal of computer vision , 128(7):1956–1981, 2020. 6

work page 1956
[24]

The role of imagenet classes in fr \’echet inception distance

Tuomas Kynk ¨a¨anniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of imagenet classes in fr \’echet inception distance. arXiv preprint arXiv:2203.06026, 2022. 6

work page arXiv 2022
[25]

Con- trastive representation learning: A framework and review

Phuc H Le-Khac, Graham Healy, and Alan F Smeaton. Con- trastive representation learning: A framework and review. Ieee Access, 8:193907–193934, 2020. 5

work page 2020
[26]

Backdoors against natural language processing: A review

Shaofeng Li, Tian Dong, Benjamin Zi Hao Zhao, Minhui Xue, Suguo Du, and Haojin Zhu. Backdoors against natural language processing: A review. IEEE Security & Privacy , 20(5):50–59, 2022. 3

work page 2022
[27]

Back- door learning: A survey

Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Back- door learning: A survey. IEEE Transactions on Neural Net- works and Learning Systems, 35(1):5–22, 2022. 3

work page 2022
[28]

Badclip: Dual- embedding guided backdoor attack on multimodal con- trastive learning

Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, and Ee-Chien Chang. Badclip: Dual- embedding guided backdoor attack on multimodal con- trastive learning. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 24645–24654, 2024. 8

work page 2024
[29]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 6

work page 2014
[30]

Retrieval-augmented diffusion models for time series fore- casting

Jingwei Liu, Ling Yang, Hongyan Li, and Shenda Hong. Retrieval-augmented diffusion models for time series fore- casting. arXiv preprint arXiv:2410.18712, 2024. 3

work page arXiv 2024
[31]

More control for free! im- age synthesis with semantic diffusion guidance

Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! im- age synthesis with semantic diffusion guidance. In Proceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023. 2

work page 2023
[32]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems , 35:5775–5787,

work page
[33]

Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models

Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 102–111,

work page
[34]

Gnn-lm: Language mod- eling based on global contexts via gnn

Yuxian Meng, Shi Zong, Xiaoya Li, Xiaofei Sun, Tianwei Zhang, Fei Wu, and Jiwei Li. Gnn-lm: Language mod- eling based on global contexts via gnn. arXiv preprint arXiv:2110.08743, 2021. 1, 2

work page arXiv 2021
[35]

Towards trustworthy re- trieval augmented generation for large language models: A survey

Bo Ni, Zheyuan Liu, Leyao Wang, Yongjia Lei, Yuying Zhao, Xueqi Cheng, Qingkai Zeng, Luna Dong, Yinglong Xia, Krishnaram Kenthapadi, et al. Towards trustworthy re- trieval augmented generation for large language models: A survey. arXiv preprint arXiv:2502.06872, 2025. 1

work page arXiv 2025
[36]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR,

work page
[38]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2

work page 2021
[39]

The devil is in the gan: backdoor attacks and defenses in deep generative models

Ambrish Rawat, Killian Levacher, and Mathieu Sinn. The devil is in the gan: backdoor attacks and defenses in deep generative models. In European Symposium on Research in Computer Security, pages 776–783. Springer, 2022. 3

work page 2022
[40]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2, 6

work page 2022
[41]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 2

work page 2022
[42]

Baaan: Backdoor attacks against autoencoder and gan-based machine learning models

Ahmed Salem, Yannick Sautter, Michael Backes, Mathias Humbert, and Yang Zhang. Baaan: Backdoor attacks against autoencoder and gan-based machine learning models. arXiv preprint arXiv:2010.03007, 2020. 3

work page arXiv 2010
[43]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022. 2

work page 2022
[44]

Retrieval-augmented score distillation for text-to-3d gener- ation

Junyoung Seo, Susung Hong, Wooseok Jang, In `es Hyeonsu Kim, Minseop Kwak, Doyup Lee, and Seungryong Kim. Retrieval-augmented score distillation for text-to-3d gener- ation. arXiv preprint arXiv:2402.02972, 2024. 3

work page arXiv 2024
[45]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 2, 6

work page 2018
[46]

Morag–multi-fusion retrieval aug- mented generation for human motion

Kalakonda Sai Shashank, Shubh Maheshwari, and Ravi Ki- ran Sarvadevabhatla. Morag–multi-fusion retrieval aug- mented generation for human motion. arXiv preprint arXiv:2409.12140, 2024. 3

work page arXiv 2024
[47]

Knn- diffusion: Image generation via large-scale retrieval

Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn- diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849, 2022. 1, 2, 3

work page arXiv 2022
[48]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2010
[49]

Rickrolling the artist: Injecting backdoors into text en- coders for text-to-image synthesis

Lukas Struppek, Dominik Hintersdorf, and Kristian Kerst- ing. Rickrolling the artist: Injecting backdoors into text en- coders for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4584–4596, 2023. 3, 6, 7

work page 2023
[50]

On the diversity and realism of distilled dataset: An efficient dataset distilla- tion paradigm

Peng Sun, Bei Shi, Daiwei Yu, and Tao Lin. On the diversity and realism of distilled dataset: An efficient dataset distilla- tion paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9390– 9399, 2024. 5

work page 2024
[51]

Retrievegan: Image synthesis via differentiable patch retrieval

Hung-Yu Tseng, Hsin-Ying Lee, Lu Jiang, Ming-Hsuan Yang, and Weilong Yang. Retrievegan: Image synthesis via differentiable patch retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16 , pages 242–257. Springer, 2020. 2

work page 2020
[52]

Eviledit: Backdooring text-to-image diffusion models in one second

Hao Wang, Shangwei Guo, Jialing He, Kangjie Chen, Shudong Zhang, Tianwei Zhang, and Tao Xiang. Eviledit: Backdooring text-to-image diffusion models in one second. In ACM Multimedia 2024, 2024. 2, 3, 6

work page 2024
[53]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Data poisoning attacks against multimodal encoders

Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, and Yang Zhang. Data poisoning attacks against multimodal encoders. In International Con- ference on Machine Learning, pages 39299–39313. PMLR,

work page
[55]

Text-to-image diffusion models can be easily backdoored through multimodal data poisoning

Shengfang Zhai, Yinpeng Dong, Qingni Shen, Shi Pu, Yue- jian Fang, and Hang Su. Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 1577–1587, 2023. 2, 3, 8

work page 2023
[56]

Re- modiffuse: Retrieval-augmented motion diffusion model

Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. Re- modiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 364–373, 2023. 3

work page 2023
[57]

Benchmarking trustworthiness of multi- modal large language models: A comprehensive study.arXiv preprint arXiv:2406.07057, 2024

Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, et al. Benchmarking trustworthiness of multi- modal large language models: A comprehensive study.arXiv preprint arXiv:2406.07057, 2024. 6

work page arXiv 2024
[58]

Badcm: Invisible backdoor attack against cross-modal learning

Zheng Zhang, Xu Yuan, Lei Zhu, Jingkuan Song, and Liqiang Nie. Badcm: Invisible backdoor attack against cross-modal learning. IEEE Transactions on Image Process- ing, 2024. 6, 7

work page 2024
[59]

Retrieval-Augmented Generation for AI-Generated Content: A Survey

Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wen- tao Zhang, and Bin Cui. Retrieval-augmented genera- tion for ai-generated content: A survey. arXiv preprint arXiv:2402.19473, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning

Hritik Bansal, Nishad Singhi, Yu Yang, Fan Yin, Aditya Grover, and Kai-Wei Chang. Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 112–123, 2023. 8

work page 2023

[2] [2]

Retrieval-augmented diffusion models

Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas M¨uller, and Bj ¨orn Ommer. Retrieval-augmented diffusion models. Advances in Neural Information Processing Sys- tems, 35:15309–15324, 2022. 1, 2, 3, 6, 8

work page 2022

[3] [3]

Improving language models by retriev- ing from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retriev- ing from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022. 2

work page 2022

[4] [4]

Instance- conditioned gan

Arantxa Casanova, Marlene Careil, Jakob Verbeek, Michal Drozdzal, and Adriana Romero Soriano. Instance- conditioned gan. Advances in Neural Information Process- ing Systems, 34:27517–27529, 2021. 2

work page 2021

[5] [5]

Phan- tom: General trigger attacks on retrieval augmented lan- guage generation

Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, and Alina Oprea. Phantom: General trigger attacks on retrieval augmented language generation. arXiv preprint arXiv:2405.20485, 2024. 3

work page arXiv 2024

[6] [6]

Label-retrieval- augmented diffusion models for learning from noisy labels

Jian Chen, Ruiyi Zhang, Tong Yu, Rohan Sharma, Zhiqiang Xu, Tong Sun, and Changyou Chen. Label-retrieval- augmented diffusion models for learning from noisy labels. Advances in Neural Information Processing Systems , 36,

work page

[7] [7]

Re-imagen: Retrieval-augmented text-to-image gen- erator

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to-image gen- erator. arXiv preprint arXiv:2209.14491, 2022. 2, 3

work page arXiv 2022

[8] [8]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 ,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. Advances in Neural Informa- tion Processing Systems, 37:130185–130213, 2025. 3

work page 2025

[10] [10]

Trojan- rag: Retrieval-augmented generation can be backdoor driver in large language models

Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojan- rag: Retrieval-augmented generation can be backdoor driver in large language models. arXiv preprint arXiv:2405.13401,

work page arXiv

[11] [11]

How to backdoor diffusion models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4015–4024, 2023

Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. How to backdoor diffusion models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4015–4024, 2023. 2, 3

work page 2023

[12] [12]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 6

work page 2009

[13] [13]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 2

work page 2021

[14] [14]

Cpr: Retrieval augmented generation for copyright protec- tion

Aditya Golatkar, Alessandro Achille, Luca Zancato, Yu- Xiang Wang, Ashwin Swaminathan, and Stefano Soatto. Cpr: Retrieval augmented generation for copyright protec- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 12374–12384,

work page

[15] [15]

Badnets: Evaluating backdooring attacks on deep neu- ral networks

Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neu- ral networks. IEEE Access, 7:47230–47244, 2019. 3

work page 2019

[16] [16]

Retrieval augmented language model pre- training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre- training. In International conference on machine learning , pages 3929–3938. PMLR, 2020. 2

work page 2020

[17] [17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6

work page 2016

[18] [18]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 6

work page 2017

[19] [19]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 1, 2

work page 2020

[21] [21]

Densely connected convolutional net- works

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. 6

work page 2017

[22] [22]

Nearest neighbor machine transla- tion

Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettle- moyer, and Mike Lewis. Nearest neighbor machine transla- tion. arXiv preprint arXiv:2010.00710, 2020. 2

work page arXiv 2010

[23] [23]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Interna- tional journal of computer vision , 128(7):1956–1981, 2020. 6

work page 1956

[24] [24]

The role of imagenet classes in fr \’echet inception distance

Tuomas Kynk ¨a¨anniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of imagenet classes in fr \’echet inception distance. arXiv preprint arXiv:2203.06026, 2022. 6

work page arXiv 2022

[25] [25]

Con- trastive representation learning: A framework and review

Phuc H Le-Khac, Graham Healy, and Alan F Smeaton. Con- trastive representation learning: A framework and review. Ieee Access, 8:193907–193934, 2020. 5

work page 2020

[26] [26]

Backdoors against natural language processing: A review

Shaofeng Li, Tian Dong, Benjamin Zi Hao Zhao, Minhui Xue, Suguo Du, and Haojin Zhu. Backdoors against natural language processing: A review. IEEE Security & Privacy , 20(5):50–59, 2022. 3

work page 2022

[27] [27]

Back- door learning: A survey

Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Back- door learning: A survey. IEEE Transactions on Neural Net- works and Learning Systems, 35(1):5–22, 2022. 3

work page 2022

[28] [28]

Badclip: Dual- embedding guided backdoor attack on multimodal con- trastive learning

Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, and Ee-Chien Chang. Badclip: Dual- embedding guided backdoor attack on multimodal con- trastive learning. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 24645–24654, 2024. 8

work page 2024

[29] [29]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 6

work page 2014

[30] [30]

Retrieval-augmented diffusion models for time series fore- casting

Jingwei Liu, Ling Yang, Hongyan Li, and Shenda Hong. Retrieval-augmented diffusion models for time series fore- casting. arXiv preprint arXiv:2410.18712, 2024. 3

work page arXiv 2024

[31] [31]

More control for free! im- age synthesis with semantic diffusion guidance

Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! im- age synthesis with semantic diffusion guidance. In Proceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023. 2

work page 2023

[32] [32]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems , 35:5775–5787,

work page

[33] [33]

Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models

Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 102–111,

work page

[34] [34]

Gnn-lm: Language mod- eling based on global contexts via gnn

Yuxian Meng, Shi Zong, Xiaoya Li, Xiaofei Sun, Tianwei Zhang, Fei Wu, and Jiwei Li. Gnn-lm: Language mod- eling based on global contexts via gnn. arXiv preprint arXiv:2110.08743, 2021. 1, 2

work page arXiv 2021

[35] [35]

Towards trustworthy re- trieval augmented generation for large language models: A survey

Bo Ni, Zheyuan Liu, Leyao Wang, Yongjia Lei, Yuying Zhao, Xueqi Cheng, Qingkai Zeng, Luna Dong, Yinglong Xia, Krishnaram Kenthapadi, et al. Towards trustworthy re- trieval augmented generation for large language models: A survey. arXiv preprint arXiv:2502.06872, 2025. 1

work page arXiv 2025

[36] [36]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021

[37] [37]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR,

work page

[38] [38]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2

work page 2021

[39] [39]

The devil is in the gan: backdoor attacks and defenses in deep generative models

Ambrish Rawat, Killian Levacher, and Mathieu Sinn. The devil is in the gan: backdoor attacks and defenses in deep generative models. In European Symposium on Research in Computer Security, pages 776–783. Springer, 2022. 3

work page 2022

[40] [40]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2, 6

work page 2022

[41] [41]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 2

work page 2022

[42] [42]

Baaan: Backdoor attacks against autoencoder and gan-based machine learning models

Ahmed Salem, Yannick Sautter, Michael Backes, Mathias Humbert, and Yang Zhang. Baaan: Backdoor attacks against autoencoder and gan-based machine learning models. arXiv preprint arXiv:2010.03007, 2020. 3

work page arXiv 2010

[43] [43]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022. 2

work page 2022

[44] [44]

Retrieval-augmented score distillation for text-to-3d gener- ation

Junyoung Seo, Susung Hong, Wooseok Jang, In `es Hyeonsu Kim, Minseop Kwak, Doyup Lee, and Seungryong Kim. Retrieval-augmented score distillation for text-to-3d gener- ation. arXiv preprint arXiv:2402.02972, 2024. 3

work page arXiv 2024

[45] [45]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 2, 6

work page 2018

[46] [46]

Morag–multi-fusion retrieval aug- mented generation for human motion

Kalakonda Sai Shashank, Shubh Maheshwari, and Ravi Ki- ran Sarvadevabhatla. Morag–multi-fusion retrieval aug- mented generation for human motion. arXiv preprint arXiv:2409.12140, 2024. 3

work page arXiv 2024

[47] [47]

Knn- diffusion: Image generation via large-scale retrieval

Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn- diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849, 2022. 1, 2, 3

work page arXiv 2022

[48] [48]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2010

[49] [49]

Rickrolling the artist: Injecting backdoors into text en- coders for text-to-image synthesis

Lukas Struppek, Dominik Hintersdorf, and Kristian Kerst- ing. Rickrolling the artist: Injecting backdoors into text en- coders for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4584–4596, 2023. 3, 6, 7

work page 2023

[50] [50]

On the diversity and realism of distilled dataset: An efficient dataset distilla- tion paradigm

Peng Sun, Bei Shi, Daiwei Yu, and Tao Lin. On the diversity and realism of distilled dataset: An efficient dataset distilla- tion paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9390– 9399, 2024. 5

work page 2024

[51] [51]

Retrievegan: Image synthesis via differentiable patch retrieval

Hung-Yu Tseng, Hsin-Ying Lee, Lu Jiang, Ming-Hsuan Yang, and Weilong Yang. Retrievegan: Image synthesis via differentiable patch retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16 , pages 242–257. Springer, 2020. 2

work page 2020

[52] [52]

Eviledit: Backdooring text-to-image diffusion models in one second

Hao Wang, Shangwei Guo, Jialing He, Kangjie Chen, Shudong Zhang, Tianwei Zhang, and Tao Xiang. Eviledit: Backdooring text-to-image diffusion models in one second. In ACM Multimedia 2024, 2024. 2, 3, 6

work page 2024

[53] [53]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Data poisoning attacks against multimodal encoders

Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, and Yang Zhang. Data poisoning attacks against multimodal encoders. In International Con- ference on Machine Learning, pages 39299–39313. PMLR,

work page

[55] [55]

Text-to-image diffusion models can be easily backdoored through multimodal data poisoning

Shengfang Zhai, Yinpeng Dong, Qingni Shen, Shi Pu, Yue- jian Fang, and Hang Su. Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 1577–1587, 2023. 2, 3, 8

work page 2023

[56] [56]

Re- modiffuse: Retrieval-augmented motion diffusion model

Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. Re- modiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 364–373, 2023. 3

work page 2023

[57] [57]

Benchmarking trustworthiness of multi- modal large language models: A comprehensive study.arXiv preprint arXiv:2406.07057, 2024

Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, et al. Benchmarking trustworthiness of multi- modal large language models: A comprehensive study.arXiv preprint arXiv:2406.07057, 2024. 6

work page arXiv 2024

[58] [58]

Badcm: Invisible backdoor attack against cross-modal learning

Zheng Zhang, Xu Yuan, Lei Zhu, Jingkuan Song, and Liqiang Nie. Badcm: Invisible backdoor attack against cross-modal learning. IEEE Transactions on Image Process- ing, 2024. 6, 7

work page 2024

[59] [59]

Retrieval-Augmented Generation for AI-Generated Content: A Survey

Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wen- tao Zhang, and Bin Cui. Retrieval-augmented genera- tion for ai-generated content: A survey. arXiv preprint arXiv:2402.19473, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024