Language Guided Adversarial Purification
Pith reviewed 2026-05-24 06:47 UTC · model grok-4.3
The pith
Using captions from input images to guide diffusion models purifies adversarial examples more effectively than most existing defenses without any specialized training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LGAP generates a caption for the input image with a pre-trained caption generator and feeds that caption into a diffusion network to guide the purification of adversarial perturbations, yielding stronger robustness on standard benchmarks than most prior defense techniques while requiring no specialized network training.
What carries the argument
Caption-guided diffusion purification, in which semantic information extracted from the generated caption constrains the denoising trajectory of the diffusion model applied to the adversarial input.
If this is right
- Adversarial defense can be performed with only pre-trained generative and captioning models, avoiding any attack-specific training.
- The same defense applies across different classifiers and attack algorithms without modification.
- Robustness gains come from semantic guidance rather than from learning attack distributions directly.
- Large-scale pre-trained models become directly usable for defense tasks without further adaptation.
Where Pith is reading between the lines
- The same caption-guidance idea might be tested on non-diffusion purification methods to check whether language constraints help other generative defenses.
- If caption accuracy degrades on certain image domains, the approach could be extended by using multiple caption models and selecting the most consistent one.
- The method suggests a route for applying language models to protect other data types, such as audio or video, once suitable generative purification backbones exist.
- Performance under very strong attacks that also target captioning models would be a natural next measurement.
Load-bearing premise
The caption produced from the adversarial image stays semantically close enough to the original content to usefully steer the diffusion purification steps.
What would settle it
A test set of adversarial images that cause caption generators to output descriptions unrelated to the true class, followed by a sharp drop in purification success rate, would show the method fails when guidance is misleading.
read the original abstract
Adversarial purification using generative models demonstrates strong adversarial defense performance. These methods are classifier and attack-agnostic, making them versatile but often computationally intensive. Recent strides in diffusion and score networks have improved image generation and, by extension, adversarial purification. Another highly efficient class of adversarial defense methods known as adversarial training requires specific knowledge of attack vectors, forcing them to be trained extensively on adversarial examples. To overcome these limitations, we introduce a new framework, namely Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators to defend against adversarial attacks. Given an input image, our method first generates a caption, which is then used to guide the adversarial purification process through a diffusion network. Our approach has been evaluated against strong adversarial attacks, proving its effectiveness in enhancing adversarial robustness. Our results indicate that LGAP outperforms most existing adversarial defense techniques without requiring specialized network training. This underscores the generalizability of models trained on large datasets, highlighting a promising direction for further research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Language Guided Adversarial Purification (LGAP), a training-free adversarial defense that first generates a caption from an input image (adversarial or clean) using a pre-trained captioner and then conditions a pre-trained diffusion model on that caption to purify the image. The central claim is that this language-guided purification outperforms most existing defenses while remaining classifier- and attack-agnostic.
Significance. If the empirical results hold after the caption-fidelity assumption is validated, the work would be significant for demonstrating that large-scale pre-trained multimodal models can be composed into a general-purpose, training-free defense without the computational cost of adversarial training or the need for attack-specific knowledge.
major comments (1)
- [Abstract, §3] Abstract and §3 (method): The headline claim that LGAP 'outperforms most existing adversarial defense techniques' is load-bearing on the assumption that captions generated from adversarial images remain semantically accurate enough to usefully constrain the diffusion process. No quantitative evaluation of caption accuracy, semantic similarity (e.g., CLIP score, BLEU), or failure rate under the evaluated attacks is reported; if caption quality collapses, the method reduces to standard diffusion purification and the claimed advantage disappears.
minor comments (2)
- [Abstract] The abstract states the method is 'evaluated against strong adversarial attacks' but does not name the specific attacks, threat models (ℓ_p norms, query access), or datasets used; these details should be stated explicitly in the abstract or §4.
- [§3] Notation for the diffusion guidance step (how the caption embedding is injected into the score network) is not introduced until the method section; a short equation or diagram in §3 would improve readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for highlighting an important aspect of our method's assumptions. We address the major comment point-by-point below.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (method): The headline claim that LGAP 'outperforms most existing adversarial defense techniques' is load-bearing on the assumption that captions generated from adversarial images remain semantically accurate enough to usefully constrain the diffusion process. No quantitative evaluation of caption accuracy, semantic similarity (e.g., CLIP score, BLEU), or failure rate under the evaluated attacks is reported; if caption quality collapses, the method reduces to standard diffusion purification and the claimed advantage disappears.
Authors: We agree that this is a substantive point. The current manuscript reports only end-to-end defense accuracy and does not provide direct quantitative measurements (CLIP score, BLEU, or failure rates) of caption fidelity on adversarial versus clean images. While the performance gap between LGAP and standard diffusion purification in our experiments is consistent with the language guidance contributing value, we acknowledge that this does not rigorously isolate the caption-quality assumption. In the revised version we will add a dedicated subsection with these metrics across the attack settings used in the paper, together with a short analysis of cases where caption quality degrades. revision: yes
Circularity Check
No circularity: empirical combination of pre-trained components with no equations or fitted quantities reducing claims to inputs
full rationale
The paper presents LGAP as a framework that applies existing pre-trained caption generators and diffusion models to adversarial images, generating a caption to condition purification. No derivation chain, equations, or parameter-fitting steps are described that would make any performance claim equivalent to its inputs by construction. The outperformance statement is framed as an empirical result from evaluation against attacks, not a mathematical prediction forced by self-definition or self-citation. The caption-fidelity assumption is a potential correctness risk but does not constitute circularity under the specified patterns, as it is not a fitted quantity renamed as a prediction nor a load-bearing self-citation. The method is self-contained against external benchmarks via standard adversarial evaluation protocols.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained diffusion models can be effectively conditioned by text captions for image reconstruction
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION The use of deep neural networks, especially within the realm of computer vision, has ushered in transformative advance- ments in various applications. Despite these strides, a consis- tent vulnerability is the susceptibility of such models to adver- sarial perturbations [1]. These perturbations, often impercep- tible, can fool even the most s...
-
[2]
Language Guided Adversarial Purification
and Carlini et al. [8], have harnessed the potential of score-based and diffusion models towards purification of ad- versarial samples. Primarily, the adversarial purification techniques have fo- cussed only on the image modality, despite promising perfor- mance of diffusion models in multi-modal tasks such as text- to-image generation [9]. Thus, in our w...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Rooted in the foundational works of Sohl-Dickstein et al
RELATED WORKS Diffusion models in image generation: The landscape of image generation has been revolutionized by diffusion mod- els. Rooted in the foundational works of Sohl-Dickstein et al
-
[4]
and later extended by Song et al. [13] and Ho et al. [14], these models have exhibited unparalleled prowess in generat- ing high-quality image samples. Song et al. [15] further ad- vanced this domain by combining generative learning mech- anisms with stochastic differential equations, thereby broad- ening the horizon of diffusion models. Language-image pr...
-
[5]
PROPOSED METHOD We propose a novel defense strategy against adversarial at- tacks on classification models by leveraging language guid- ance in diffusion models for adversarial purification. For a clean sample x with label y, and a target neural network fθ, the adversary aims to produce xadv by introducing adversarial perturbations. This results in a pred...
-
[6]
EXPERIMENTS AND RESULTS 4.1. Experimental settings Datasets and network architectures: Our experimental evaluation involves three datasets, namely CIFAR-10 [11], CIFA-100 [11] and ImageNet [10]. We utilize the base mod- els from RobustBench [23] model zoo for CIFAR-10 and Im- ageNet. For CIFAR-100 we train the model following Yoon et al. [6]. We compare o...
work page 2048
-
[7]
CONCLUSION Our method addressed key limitations in adversarial defense by introducing a language-guided purification approach. Un- like traditional methods, which require extensive computa- tional resources and specific attack knowledge, our method leverages pre-trained diffusion models and caption gener- ators. This reduces computational overhead and enh...
-
[8]
Explaining and harnessing adversarial examples,
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and harnessing adversarial examples,” in ICLR, 2015
work page 2015
-
[9]
Towards deep learning models resistant to adversarial attacks,
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu, “Towards deep learning models resistant to adversarial attacks,” in ICLR, 2018
work page 2018
-
[10]
Pixeldefend: Leveraging generative mod- els to understand and defend against adversarial examples,
Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman, “Pixeldefend: Leveraging generative mod- els to understand and defend against adversarial examples,” in ICLR, 2018
work page 2018
-
[11]
Defense-gan: Protecting classifiers against adversarial attacks using generative models,
Pouya Samangouei, Maya Kabkab, and Rama Chellappa, “Defense-gan: Protecting classifiers against adversarial attacks using generative models,” in ICLR, 2018
work page 2018
-
[12]
Online adver- sarial purification based on self-supervised learning,
Changhao Shi, Chester Holtz, and Gal Mishne, “Online adver- sarial purification based on self-supervised learning,” inICLR, 2020
work page 2020
-
[13]
Adversarial purification with score-based generative models,
Jongmin Yoon, Sung Ju Hwang, and Juho Lee, “Adversarial purification with score-based generative models,” in ICML, 2021
work page 2021
-
[14]
Diffusion models for adversarial purification,
Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Animashree Anandkumar, “Diffusion models for adversarial purification,” in ICML, 2022
work page 2022
-
[15]
(cer- tified!!) adversarial robustness for free!,
Nicholas Carlini, Florian Tramer, Krishnamurthy Dj Dvi- jotham, Leslie Rice, Mingjie Sun, and J Zico Kolter, “(cer- tified!!) adversarial robustness for free!,” in ICLR, 2022
work page 2022
-
[16]
High-resolution image synthesis with latent diffusion models,
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022
work page 2022
-
[17]
Imagenet: A large-scale hierarchical image database,
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009
work page 2009
-
[18]
Learning multiple layers of features from tiny images,
Alex Krizhevsky, Geoffrey Hinton, et al., “Learning multiple layers of features from tiny images,” 2009
work page 2009
-
[19]
Deep unsupervised learning using nonequilibrium thermodynamics,
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015, pp. 2256– 2265
work page 2015
-
[20]
Generative modeling by es- timating gradients of the data distribution,
Yang Song and Stefano Ermon, “Generative modeling by es- timating gradients of the data distribution,” NeurIPS, vol. 32, 2019
work page 2019
-
[21]
Denoising diffu- sion probabilistic models,
Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffu- sion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020
work page 2020
-
[22]
Score-based generative modeling through stochastic differential equations,
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole, “Score-based generative modeling through stochastic differential equations,” in ICLR, 2020
work page 2020
-
[23]
Learning trans- ferable visual models from natural language supervision,
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning trans- ferable visual models from natural language supervision,” in ICML, 2021
work page 2021
-
[24]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi, “Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation,” in ICML, 2022
work page 2022
-
[25]
Metric learning for adversarial robust- ness,
Chengzhi Mao, Ziyuan Zhong, Junfeng Yang, Carl V ondrick, and Baishakhi Ray, “Metric learning for adversarial robust- ness,” NeurIPS, vol. 32, 2019
work page 2019
-
[26]
Self-supervised adversarial training,
Kejiang Chen, Yuefeng Chen, Hang Zhou, Xiaofeng Mao, Yuhong Li, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu, “Self-supervised adversarial training,” in ICASSP, 2020
work page 2020
-
[27]
Adversarial training for free!,
Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein, “Adversarial training for free!,” NeurIPS, vol. 32, 2019
work page 2019
-
[28]
Fast is better than free: Revisiting adversarial training,
Eric Wong, Leslie Rice, and J Zico Kolter, “Fast is better than free: Revisiting adversarial training,” in ICLR, 2019
work page 2019
-
[29]
Your classifier is secretly an energy based model and you should treat it like one,
Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky, “Your classifier is secretly an energy based model and you should treat it like one,” in ICLR, 2019
work page 2019
-
[30]
Robustbench: a stan- dardized adversarial robustness benchmark,
Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein, “Robustbench: a stan- dardized adversarial robustness benchmark,” arXiv preprint arXiv:2010.09670, 2020
-
[31]
Deep residual learning for image recognition,
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016
work page 2016
-
[32]
Sergey Zagoruyko and Nikos Komodakis, “Wide residual net- works,” arXiv preprint arXiv:1605.07146, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[33]
Stochas- tic security: Adversarial defense using long-run dynamics of energy-based models,
Mitch Hill, Jonathan Mitchell, and Song-Chun Zhu, “Stochas- tic security: Adversarial defense using long-run dynamics of energy-based models,” in ICLR, 2021
work page 2021
-
[34]
Implicit generation and model- ing with energy based models,
Yilun Du and Igor Mordatch, “Implicit generation and model- ing with energy based models,” in NeurIPS, 2019
work page 2019
-
[35]
Junhao Dong, Seyed-Mohsen Moosavi-Dezfooli, Jianhuang Lai, and Xiaohua Xie, “The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial train- ing,” in CVPR, 2023
work page 2023
-
[36]
Anish Athalye, Nicholas Carlini, and David Wagner, “Obfus- cated gradients give a false sense of security: Circumventing defenses to adversarial examples,” in ICML, 2018
work page 2018
-
[37]
Defense-V AE: A fast and accurate defense against adversarial attacks,
Xiang Li and Shihao Ji, “Defense-V AE: A fast and accurate defense against adversarial attacks,” in Machine Learning and Knowledge Discovery in Databases , Peggy Cellier and Kurt Driessens, Eds. pp. 191–207, Springer International Publish- ing
-
[38]
Me-net: Towards effective adversarial robustness with matrix estima- tion,
Yuzhe Yang, Guo Zhang, Dina Katabi, and Zhi Xu, “Me-net: Towards effective adversarial robustness with matrix estima- tion,” in ICML, 2019
work page 2019
-
[39]
Unlabeled data improves adver- sarial robustness,
Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C Duchi, and Percy S Liang, “Unlabeled data improves adver- sarial robustness,” NeurIPS, 2019
work page 2019
-
[40]
Do adversarially robust imagenet models transfer better?,
Hadi Salman, Andrew Ilyas, Logan Engstrom, Ashish Kapoor, and Aleksander Madry, “Do adversarially robust imagenet models transfer better?,” NeurIPS, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.