pith. sign in

arxiv: 2309.10348 · v1 · submitted 2023-09-19 · 💻 cs.LG · cs.CR· cs.CV

Language Guided Adversarial Purification

Pith reviewed 2026-05-24 06:47 UTC · model grok-4.3

classification 💻 cs.LG cs.CRcs.CV
keywords adversarial purificationdiffusion modelslanguage guidanceadversarial defenseimage robustnesscaption generationgenerative models
0
0 comments X

The pith

Using captions from input images to guide diffusion models purifies adversarial examples more effectively than most existing defenses without any specialized training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LGAP, a framework that first produces a caption for a given image and then uses that caption to steer a pre-trained diffusion model during the removal of adversarial perturbations. This setup seeks to make purification methods more effective by adding semantic constraints from language while keeping the defense independent of any particular classifier or attack type. A sympathetic reader would care because current alternatives either demand heavy computation or require retraining networks on attack examples, whereas this method relies only on off-the-shelf models. If the claim holds, image classification systems could gain robustness simply by chaining existing caption generators and diffusion networks.

Core claim

LGAP generates a caption for the input image with a pre-trained caption generator and feeds that caption into a diffusion network to guide the purification of adversarial perturbations, yielding stronger robustness on standard benchmarks than most prior defense techniques while requiring no specialized network training.

What carries the argument

Caption-guided diffusion purification, in which semantic information extracted from the generated caption constrains the denoising trajectory of the diffusion model applied to the adversarial input.

If this is right

  • Adversarial defense can be performed with only pre-trained generative and captioning models, avoiding any attack-specific training.
  • The same defense applies across different classifiers and attack algorithms without modification.
  • Robustness gains come from semantic guidance rather than from learning attack distributions directly.
  • Large-scale pre-trained models become directly usable for defense tasks without further adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same caption-guidance idea might be tested on non-diffusion purification methods to check whether language constraints help other generative defenses.
  • If caption accuracy degrades on certain image domains, the approach could be extended by using multiple caption models and selecting the most consistent one.
  • The method suggests a route for applying language models to protect other data types, such as audio or video, once suitable generative purification backbones exist.
  • Performance under very strong attacks that also target captioning models would be a natural next measurement.

Load-bearing premise

The caption produced from the adversarial image stays semantically close enough to the original content to usefully steer the diffusion purification steps.

What would settle it

A test set of adversarial images that cause caption generators to output descriptions unrelated to the true class, followed by a sharp drop in purification success rate, would show the method fails when guidance is misleading.

read the original abstract

Adversarial purification using generative models demonstrates strong adversarial defense performance. These methods are classifier and attack-agnostic, making them versatile but often computationally intensive. Recent strides in diffusion and score networks have improved image generation and, by extension, adversarial purification. Another highly efficient class of adversarial defense methods known as adversarial training requires specific knowledge of attack vectors, forcing them to be trained extensively on adversarial examples. To overcome these limitations, we introduce a new framework, namely Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators to defend against adversarial attacks. Given an input image, our method first generates a caption, which is then used to guide the adversarial purification process through a diffusion network. Our approach has been evaluated against strong adversarial attacks, proving its effectiveness in enhancing adversarial robustness. Our results indicate that LGAP outperforms most existing adversarial defense techniques without requiring specialized network training. This underscores the generalizability of models trained on large datasets, highlighting a promising direction for further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Language Guided Adversarial Purification (LGAP), a training-free adversarial defense that first generates a caption from an input image (adversarial or clean) using a pre-trained captioner and then conditions a pre-trained diffusion model on that caption to purify the image. The central claim is that this language-guided purification outperforms most existing defenses while remaining classifier- and attack-agnostic.

Significance. If the empirical results hold after the caption-fidelity assumption is validated, the work would be significant for demonstrating that large-scale pre-trained multimodal models can be composed into a general-purpose, training-free defense without the computational cost of adversarial training or the need for attack-specific knowledge.

major comments (1)
  1. [Abstract, §3] Abstract and §3 (method): The headline claim that LGAP 'outperforms most existing adversarial defense techniques' is load-bearing on the assumption that captions generated from adversarial images remain semantically accurate enough to usefully constrain the diffusion process. No quantitative evaluation of caption accuracy, semantic similarity (e.g., CLIP score, BLEU), or failure rate under the evaluated attacks is reported; if caption quality collapses, the method reduces to standard diffusion purification and the claimed advantage disappears.
minor comments (2)
  1. [Abstract] The abstract states the method is 'evaluated against strong adversarial attacks' but does not name the specific attacks, threat models (ℓ_p norms, query access), or datasets used; these details should be stated explicitly in the abstract or §4.
  2. [§3] Notation for the diffusion guidance step (how the caption embedding is injected into the score network) is not introduced until the method section; a short equation or diagram in §3 would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting an important aspect of our method's assumptions. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (method): The headline claim that LGAP 'outperforms most existing adversarial defense techniques' is load-bearing on the assumption that captions generated from adversarial images remain semantically accurate enough to usefully constrain the diffusion process. No quantitative evaluation of caption accuracy, semantic similarity (e.g., CLIP score, BLEU), or failure rate under the evaluated attacks is reported; if caption quality collapses, the method reduces to standard diffusion purification and the claimed advantage disappears.

    Authors: We agree that this is a substantive point. The current manuscript reports only end-to-end defense accuracy and does not provide direct quantitative measurements (CLIP score, BLEU, or failure rates) of caption fidelity on adversarial versus clean images. While the performance gap between LGAP and standard diffusion purification in our experiments is consistent with the language guidance contributing value, we acknowledge that this does not rigorously isolate the caption-quality assumption. In the revised version we will add a dedicated subsection with these metrics across the attack settings used in the paper, together with a short analysis of cases where caption quality degrades. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical combination of pre-trained components with no equations or fitted quantities reducing claims to inputs

full rationale

The paper presents LGAP as a framework that applies existing pre-trained caption generators and diffusion models to adversarial images, generating a caption to condition purification. No derivation chain, equations, or parameter-fitting steps are described that would make any performance claim equivalent to its inputs by construction. The outperformance statement is framed as an empirical result from evaluation against attacks, not a mathematical prediction forced by self-definition or self-citation. The caption-fidelity assumption is a potential correctness risk but does not constitute circularity under the specified patterns, as it is not a fitted quantity renamed as a prediction nor a load-bearing self-citation. The method is self-contained against external benchmarks via standard adversarial evaluation protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters or invented entities; the method rests on the domain assumption that pre-trained diffusion and caption models transfer to the purification task.

axioms (1)
  • domain assumption Pre-trained diffusion models can be effectively conditioned by text captions for image reconstruction
    Invoked when the caption is used to guide purification.

pith-pipeline@v0.9.0 · 5697 in / 1110 out tokens · 19391 ms · 2026-05-24T06:47:36.190999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

  1. [1]

    Image Captioning

    INTRODUCTION The use of deep neural networks, especially within the realm of computer vision, has ushered in transformative advance- ments in various applications. Despite these strides, a consis- tent vulnerability is the susceptibility of such models to adver- sarial perturbations [1]. These perturbations, often impercep- tible, can fool even the most s...

  2. [2]

    Language Guided Adversarial Purification

    and Carlini et al. [8], have harnessed the potential of score-based and diffusion models towards purification of ad- versarial samples. Primarily, the adversarial purification techniques have fo- cussed only on the image modality, despite promising perfor- mance of diffusion models in multi-modal tasks such as text- to-image generation [9]. Thus, in our w...

  3. [3]

    Rooted in the foundational works of Sohl-Dickstein et al

    RELATED WORKS Diffusion models in image generation: The landscape of image generation has been revolutionized by diffusion mod- els. Rooted in the foundational works of Sohl-Dickstein et al

  4. [4]

    [13] and Ho et al

    and later extended by Song et al. [13] and Ho et al. [14], these models have exhibited unparalleled prowess in generat- ing high-quality image samples. Song et al. [15] further ad- vanced this domain by combining generative learning mech- anisms with stochastic differential equations, thereby broad- ening the horizon of diffusion models. Language-image pr...

  5. [5]

    For a clean sample x with label y, and a target neural network fθ, the adversary aims to produce xadv by introducing adversarial perturbations

    PROPOSED METHOD We propose a novel defense strategy against adversarial at- tacks on classification models by leveraging language guid- ance in diffusion models for adversarial purification. For a clean sample x with label y, and a target neural network fθ, the adversary aims to produce xadv by introducing adversarial perturbations. This results in a pred...

  6. [6]

    Experimental settings Datasets and network architectures: Our experimental evaluation involves three datasets, namely CIFAR-10 [11], CIFA-100 [11] and ImageNet [10]

    EXPERIMENTS AND RESULTS 4.1. Experimental settings Datasets and network architectures: Our experimental evaluation involves three datasets, namely CIFAR-10 [11], CIFA-100 [11] and ImageNet [10]. We utilize the base mod- els from RobustBench [23] model zoo for CIFAR-10 and Im- ageNet. For CIFAR-100 we train the model following Yoon et al. [6]. We compare o...

  7. [7]

    CONCLUSION Our method addressed key limitations in adversarial defense by introducing a language-guided purification approach. Un- like traditional methods, which require extensive computa- tional resources and specific attack knowledge, our method leverages pre-trained diffusion models and caption gener- ators. This reduces computational overhead and enh...

  8. [8]

    Explaining and harnessing adversarial examples,

    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and harnessing adversarial examples,” in ICLR, 2015

  9. [9]

    Towards deep learning models resistant to adversarial attacks,

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu, “Towards deep learning models resistant to adversarial attacks,” in ICLR, 2018

  10. [10]

    Pixeldefend: Leveraging generative mod- els to understand and defend against adversarial examples,

    Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman, “Pixeldefend: Leveraging generative mod- els to understand and defend against adversarial examples,” in ICLR, 2018

  11. [11]

    Defense-gan: Protecting classifiers against adversarial attacks using generative models,

    Pouya Samangouei, Maya Kabkab, and Rama Chellappa, “Defense-gan: Protecting classifiers against adversarial attacks using generative models,” in ICLR, 2018

  12. [12]

    Online adver- sarial purification based on self-supervised learning,

    Changhao Shi, Chester Holtz, and Gal Mishne, “Online adver- sarial purification based on self-supervised learning,” inICLR, 2020

  13. [13]

    Adversarial purification with score-based generative models,

    Jongmin Yoon, Sung Ju Hwang, and Juho Lee, “Adversarial purification with score-based generative models,” in ICML, 2021

  14. [14]

    Diffusion models for adversarial purification,

    Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Animashree Anandkumar, “Diffusion models for adversarial purification,” in ICML, 2022

  15. [15]

    (cer- tified!!) adversarial robustness for free!,

    Nicholas Carlini, Florian Tramer, Krishnamurthy Dj Dvi- jotham, Leslie Rice, Mingjie Sun, and J Zico Kolter, “(cer- tified!!) adversarial robustness for free!,” in ICLR, 2022

  16. [16]

    High-resolution image synthesis with latent diffusion models,

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022

  17. [17]

    Imagenet: A large-scale hierarchical image database,

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009

  18. [18]

    Learning multiple layers of features from tiny images,

    Alex Krizhevsky, Geoffrey Hinton, et al., “Learning multiple layers of features from tiny images,” 2009

  19. [19]

    Deep unsupervised learning using nonequilibrium thermodynamics,

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015, pp. 2256– 2265

  20. [20]

    Generative modeling by es- timating gradients of the data distribution,

    Yang Song and Stefano Ermon, “Generative modeling by es- timating gradients of the data distribution,” NeurIPS, vol. 32, 2019

  21. [21]

    Denoising diffu- sion probabilistic models,

    Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffu- sion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020

  22. [22]

    Score-based generative modeling through stochastic differential equations,

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole, “Score-based generative modeling through stochastic differential equations,” in ICLR, 2020

  23. [23]

    Learning trans- ferable visual models from natural language supervision,

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning trans- ferable visual models from natural language supervision,” in ICML, 2021

  24. [24]

    Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation,

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi, “Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation,” in ICML, 2022

  25. [25]

    Metric learning for adversarial robust- ness,

    Chengzhi Mao, Ziyuan Zhong, Junfeng Yang, Carl V ondrick, and Baishakhi Ray, “Metric learning for adversarial robust- ness,” NeurIPS, vol. 32, 2019

  26. [26]

    Self-supervised adversarial training,

    Kejiang Chen, Yuefeng Chen, Hang Zhou, Xiaofeng Mao, Yuhong Li, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu, “Self-supervised adversarial training,” in ICASSP, 2020

  27. [27]

    Adversarial training for free!,

    Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein, “Adversarial training for free!,” NeurIPS, vol. 32, 2019

  28. [28]

    Fast is better than free: Revisiting adversarial training,

    Eric Wong, Leslie Rice, and J Zico Kolter, “Fast is better than free: Revisiting adversarial training,” in ICLR, 2019

  29. [29]

    Your classifier is secretly an energy based model and you should treat it like one,

    Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky, “Your classifier is secretly an energy based model and you should treat it like one,” in ICLR, 2019

  30. [30]

    Robustbench: a stan- dardized adversarial robustness benchmark,

    Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein, “Robustbench: a stan- dardized adversarial robustness benchmark,” arXiv preprint arXiv:2010.09670, 2020

  31. [31]

    Deep residual learning for image recognition,

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016

  32. [32]

    Wide Residual Networks

    Sergey Zagoruyko and Nikos Komodakis, “Wide residual net- works,” arXiv preprint arXiv:1605.07146, 2016

  33. [33]

    Stochas- tic security: Adversarial defense using long-run dynamics of energy-based models,

    Mitch Hill, Jonathan Mitchell, and Song-Chun Zhu, “Stochas- tic security: Adversarial defense using long-run dynamics of energy-based models,” in ICLR, 2021

  34. [34]

    Implicit generation and model- ing with energy based models,

    Yilun Du and Igor Mordatch, “Implicit generation and model- ing with energy based models,” in NeurIPS, 2019

  35. [35]

    The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial train- ing,

    Junhao Dong, Seyed-Mohsen Moosavi-Dezfooli, Jianhuang Lai, and Xiaohua Xie, “The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial train- ing,” in CVPR, 2023

  36. [36]

    Obfus- cated gradients give a false sense of security: Circumventing defenses to adversarial examples,

    Anish Athalye, Nicholas Carlini, and David Wagner, “Obfus- cated gradients give a false sense of security: Circumventing defenses to adversarial examples,” in ICML, 2018

  37. [37]

    Defense-V AE: A fast and accurate defense against adversarial attacks,

    Xiang Li and Shihao Ji, “Defense-V AE: A fast and accurate defense against adversarial attacks,” in Machine Learning and Knowledge Discovery in Databases , Peggy Cellier and Kurt Driessens, Eds. pp. 191–207, Springer International Publish- ing

  38. [38]

    Me-net: Towards effective adversarial robustness with matrix estima- tion,

    Yuzhe Yang, Guo Zhang, Dina Katabi, and Zhi Xu, “Me-net: Towards effective adversarial robustness with matrix estima- tion,” in ICML, 2019

  39. [39]

    Unlabeled data improves adver- sarial robustness,

    Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C Duchi, and Percy S Liang, “Unlabeled data improves adver- sarial robustness,” NeurIPS, 2019

  40. [40]

    Do adversarially robust imagenet models transfer better?,

    Hadi Salman, Andrew Ilyas, Logan Engstrom, Ashish Kapoor, and Aleksander Madry, “Do adversarially robust imagenet models transfer better?,” NeurIPS, 2020