pith. sign in

arxiv: 1907.05718 · v1 · pith:ZTJCSMSLnew · submitted 2019-07-11 · 💻 cs.LG · cs.AI· stat.ML

Why Blocking Targeted Adversarial Perturbations Impairs the Ability to Learn

Pith reviewed 2026-05-24 23:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords adversarial perturbationsdefensive distillationtargeted attacksneural network classifiersinput gradientwhite-box attacksmachine learning security
0
0 comments X

The pith

Targeted adversarial attacks share the training input gradient, so defenses against them impair learning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why defenses have struggled to block white-box adversarial perturbations on neural network classifiers. It focuses on defensive distillation and shows through systematic tests that this approach stops non-targeted attacks but fails against targeted ones such as the Targeted Gradient Sign Method. The central observation is that targeted attacks rely on the loss gradient with respect to the input, the same signal used to train the network. Blocking access to that gradient to stop the attacks would therefore also stop training. This tradeoff explains the persistent difficulty in creating robust defenses.

Core claim

Defensive distillation is highly effective against non-targeted attacks but is unsuitable for targeted attacks. This discovery leads us to realize that targeted attacks leverage the same input gradient that allows a network to be trained. This implies that blocking them will require losing the network's ability to learn, presenting an impossible tradeoff to the research community.

What carries the argument

The input gradient of the loss with respect to the input, which targeted attacks use to craft perturbations and which training uses to update weights.

Load-bearing premise

That the gradient used to generate targeted attacks is the identical signal required for the network to learn from data.

What would settle it

A defense that blocks targeted attacks while still allowing the network to compute and use its input gradient for weight updates during training would falsify the claim.

Figures

Figures reproduced from arXiv: 1907.05718 by Yuval Elovici, Ziv Katzir.

Figure 1
Figure 1. Figure 1: illustrates the result of this experiment, comparing the logit values for normal and FGSM perturbed inputs. The two logits components are indicated by the two axes of the plot. The horizontal axis represents the logit component associated with class 0, while the vertical axis represents the logit component of class 1. Each plotted point represents the logit values of a single input image from the test set,… view at source ↗
Figure 2
Figure 2. Figure 2: The effect of FGSM on a defensively distilled classifier network. (a) Logit values given normal input, (b) logit values for FGSM perturbed input, (c) perturbation shift, comparing the original and perturbed input. We continued to explore the defensively distilled classifier network using TGSM, targeted BIM, and the C&W attacks. In all three cases defensive distillation was unable to block the perturbation.… view at source ↗
Figure 3
Figure 3. Figure 3: The effect of TGSM on a defensively distilled classifier network. (a) Logit values given normal input, (b) logit values for TGSM perturbed input, (c) perturbation shift, comparing the original and perturbed input. 4. Formal Analysis of Input Loss Gradients As indicated in Section 3, empirical exploration of the logit values of the defensively distilled two-class classifier led us to believe that defensive … view at source ↗
Figure 4
Figure 4. Figure 4: Comparing (a) normal and (b) defensively distilled softmax values. This analysis of the softmax outputs suggests it should be possible to optimize the defensive distillation process by using just a single network training phase. Training a single model with high distillation temperature in order to increase its certainty of the correct class label, and then setting the temperature to 1 during inference. In… view at source ↗
read the original abstract

Despite their accuracy, neural network-based classifiers are still prone to manipulation through adversarial perturbations. Those perturbations are designed to be misclassified by the neural network, while being perceptually identical to some valid input. The vast majority of attack methods rely on white-box conditions, where the attacker has full knowledge of the attacked network's parameters. This allows the attacker to calculate the network's loss gradient with respect to some valid input and use this gradient in order to create an adversarial example. The task of blocking white-box attacks has proven difficult to solve. While a large number of defense methods have been suggested, they have had limited success. In this work we examine this difficulty and try to understand it. We systematically explore the abilities and limitations of defensive distillation, one of the most promising defense mechanisms against adversarial perturbations suggested so far in order to understand the defense challenge. We show that contrary to commonly held belief, the ability to bypass defensive distillation is not dependent on an attack's level of sophistication. In fact, simple approaches, such as the Targeted Gradient Sign Method, are capable of effectively bypassing defensive distillation. We prove that defensive distillation is highly effective against non-targeted attacks but is unsuitable for targeted attacks. This discovery leads us to realize that targeted attacks leverage the same input gradient that allows a network to be trained. This implies that blocking them will require losing the network's ability to learn, presenting an impossible tradeoff to the research community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that defensive distillation is highly effective against non-targeted adversarial attacks but unsuitable for targeted attacks such as the Targeted Gradient Sign Method (TGSM). Experiments show that simple targeted attacks bypass distillation, leading to the conclusion that targeted attacks leverage the same input gradient used in network training; therefore, blocking targeted attacks necessarily impairs the network's ability to learn, creating an impossible tradeoff for defenses.

Significance. If the central interpretive claim holds, the result would be significant for the adversarial robustness literature by identifying a potential fundamental limitation: any defense that neutralizes the shared gradient for targeted attacks would also prevent standard training. This could shift focus toward non-gradient-based or architecture-level defenses and explain why many white-box defenses have limited success.

major comments (2)
  1. [Abstract] Abstract: The claim that 'targeted attacks leverage the same input gradient that allows a network to be trained' and that 'blocking them will require losing the network's ability to learn' is presented as following directly from the distillation bypass results, but no derivation, mathematical argument, or ablation is provided showing that the gradient is necessarily shared or that it is the only possible defense target. This step is load-bearing for the 'impossible tradeoff' conclusion.
  2. [Abstract / conclusion paragraph] The manuscript reports differential effectiveness (distillation succeeds on non-targeted but fails on targeted TGSM) yet does not rule out alternative defense mechanisms that might neutralize targeted gradient exploitation without nullifying the training gradient; the generalization from distillation-specific results to all possible defenses therefore lacks supporting analysis.
minor comments (1)
  1. The abstract would be clearer if it briefly stated the datasets, model architectures, and quantitative success rates of the TGSM bypass experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help clarify the scope and strength of our claims. We address the major comments point-by-point below. Our responses focus on the empirical nature of the work and the specific role of defensive distillation in revealing the gradient tradeoff.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'targeted attacks leverage the same input gradient that allows a network to be trained' and that 'blocking them will require losing the network's ability to learn' is presented as following directly from the distillation bypass results, but no derivation, mathematical argument, or ablation is provided showing that the gradient is necessarily shared or that it is the only possible defense target. This step is load-bearing for the 'impossible tradeoff' conclusion.

    Authors: The manuscript's conclusion is indeed an interpretation based on the experimental findings rather than a formal derivation. Defensive distillation is designed to reduce the effectiveness of gradient-based attacks by smoothing the output probabilities. Its success against non-targeted attacks but failure against targeted TGSM indicates that targeted attacks rely on the precise gradient information that training also uses. We do not claim this is mathematically proven as the only possible mechanism; however, the results suggest that defenses targeting the gradient will face this tradeoff. We will revise the abstract to emphasize that this is an empirical observation leading to the interpretive claim, and add a short discussion on the lack of formal proof. revision: partial

  2. Referee: [Abstract / conclusion paragraph] The manuscript reports differential effectiveness (distillation succeeds on non-targeted but fails on targeted TGSM) yet does not rule out alternative defense mechanisms that might neutralize targeted gradient exploitation without nullifying the training gradient; the generalization from distillation-specific results to all possible defenses therefore lacks supporting analysis.

    Authors: We agree that the paper does not exhaustively rule out all conceivable alternative defenses. Our focus is on defensive distillation as a representative gradient-masking defense to demonstrate the differential effect. The generalization is that since distillation aims to block gradient exploitation but cannot do so for targeted attacks without (implicitly) affecting the usable gradient for training, it points to a broader challenge. We will add text in the conclusion to explicitly state that while other mechanisms might be possible, the distillation results highlight why gradient-based defenses struggle with targeted attacks, without claiming universality. revision: partial

Circularity Check

0 steps flagged

No circularity; central claim is interpretive inference from experiments, not reduction by construction.

full rationale

The paper's derivation proceeds from systematic experiments on defensive distillation (showing effectiveness against non-targeted attacks but failure on targeted ones such as TGSM) to the interpretive statement that targeted attacks leverage the training input gradient. This inference does not reduce to any self-definitional equation, fitted parameter renamed as prediction, or self-citation chain. No equations or prior-author uniqueness theorems are invoked to force the conclusion; the claim remains an observation external to the reported results and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the standard machine-learning assumption that backpropagation uses the input gradient for training; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The gradient of the loss with respect to the input is the mechanism both for generating targeted adversarial examples and for updating model parameters during training.
    Invoked in the abstract's final paragraph to reach the 'impossible tradeoff' conclusion.

pith-pipeline@v0.9.0 · 5789 in / 1298 out tokens · 34901 ms · 2026-05-24T23:08:39.259712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 9 internal anchors

  1. [1]

    Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199

  2. [2]

    (2016, March)

    Papernot, N., et al. (2016, March). The limitations of deep learnin g in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on (pp. 372-387). IEEE

  3. [3]

    Explaining and Harnessing Adversarial Examples

    Goodfellow, I., et al. (2014) Explaining and harnessing adversarial examples (2014). arXiv preprint arXiv:1412.6572

  4. [4]

    Kurakin, A., et al. (201 6). Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236

  5. [5]

    Carlini, N., & Wagner, D. (2016b). Towards evaluating the robustness of neural networks. arXiv preprint arXiv:1608.04644

  6. [6]

    Papernot, N., et al. (2015). Distillation as a defense to adversarial perturbations against deep neural networks. arXiv preprint arXiv:1511.04508

  7. [7]

    Kurakin, A., Goodfellow, I., & Bengio, S. (2016). Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533

  8. [8]

    S., & Doshi -Velez, F

    Ross, A. S., & Doshi -Velez, F. (2018, April). Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Thirty - Second AAAI Conference on Artificial Intelligence

  9. [9]

    LeCun, Y., Cortes, C., & Burges, C. J. (2010). MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2

  10. [10]

    A., & Campbell, N

    Dunne, R. A., & Campbell, N. A. (1997). On the pairing of the Softmax activation and cross-entropy penalty functions and the derivation of the Softmax activation function. In Proc. 8th Aust. Conf. on the Neural Networks, Melbourne (Vol. 181, p. 185). 18

  11. [11]

    Akhtar, N., & Mian, A. (2018). Threat of adversarial attacks on deep learning in computer vision: A survey. arXiv preprint arXiv:1801.00553

  12. [12]

    M., Fawzi, A., Fa wzi, O., & Frossard, P

    Moosavi-Dezfooli, S. M., Fawzi, A., Fa wzi, O., & Frossard, P. (2017). Universal adversarial perturbations. arXiv preprint

  13. [13]

    Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images (Vol. 1, No. 4, p. 7). Technical report, University of Toronto

  14. [14]

    B., & Swami, A

    Papernot, N., McD aniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2017, April). Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security (pp. 506 -519). ACM

  15. [15]

    Keras MNIST CNN Tutorial https://github.com/keras- team/keras/blob/master/examples/mnist_cnn.py Accessed June 2019

  16. [16]

    (2017, November)

    Carlini, N., & Wagner, D. (2017, November). Ad versarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (pp. 3-14). ACM

  17. [17]

    Athalye, A., Carlini, N., & Wagner, D. (2018). Obfuscated gradients give a false s ense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420

  18. [18]

    Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

  19. [19]

    B., & Swami, A

    Papernot, N., McDaniel, P., Goodfe llow, I., Jha, S., Celik, Z. B., & Swami, A. (2017, April). Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security(pp. 506 -519). ACM

  20. [20]

    This derivation was first formally described in [10]

    Classification and Loss Evaluatio n https://deepnotes.io/softmax-crossentropy accessed June 2019 19 Appendix A – Derivation of the Cross Entropy Loss and Softmax Activation Function This section describes the gradient derivation o f the cross -entropy loss combined with the softmax activation function. This derivation was first formally described in [10]....