Why Blocking Targeted Adversarial Perturbations Impairs the Ability to Learn

Yuval Elovici; Ziv Katzir

arxiv: 1907.05718 · v1 · pith:ZTJCSMSLnew · submitted 2019-07-11 · 💻 cs.LG · cs.AI· stat.ML

Why Blocking Targeted Adversarial Perturbations Impairs the Ability to Learn

Ziv Katzir , Yuval Elovici This is my paper

Pith reviewed 2026-05-24 23:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords adversarial perturbationsdefensive distillationtargeted attacksneural network classifiersinput gradientwhite-box attacksmachine learning security

0 comments

The pith

Targeted adversarial attacks share the training input gradient, so defenses against them impair learning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why defenses have struggled to block white-box adversarial perturbations on neural network classifiers. It focuses on defensive distillation and shows through systematic tests that this approach stops non-targeted attacks but fails against targeted ones such as the Targeted Gradient Sign Method. The central observation is that targeted attacks rely on the loss gradient with respect to the input, the same signal used to train the network. Blocking access to that gradient to stop the attacks would therefore also stop training. This tradeoff explains the persistent difficulty in creating robust defenses.

Core claim

Defensive distillation is highly effective against non-targeted attacks but is unsuitable for targeted attacks. This discovery leads us to realize that targeted attacks leverage the same input gradient that allows a network to be trained. This implies that blocking them will require losing the network's ability to learn, presenting an impossible tradeoff to the research community.

What carries the argument

The input gradient of the loss with respect to the input, which targeted attacks use to craft perturbations and which training uses to update weights.

Load-bearing premise

That the gradient used to generate targeted attacks is the identical signal required for the network to learn from data.

What would settle it

A defense that blocks targeted attacks while still allowing the network to compute and use its input gradient for weight updates during training would falsify the claim.

Figures

Figures reproduced from arXiv: 1907.05718 by Yuval Elovici, Ziv Katzir.

**Figure 1.** Figure 1: illustrates the result of this experiment, comparing the logit values for normal and FGSM perturbed inputs. The two logits components are indicated by the two axes of the plot. The horizontal axis represents the logit component associated with class 0, while the vertical axis represents the logit component of class 1. Each plotted point represents the logit values of a single input image from the test set,… view at source ↗

**Figure 2.** Figure 2: The effect of FGSM on a defensively distilled classifier network. (a) Logit values given normal input, (b) logit values for FGSM perturbed input, (c) perturbation shift, comparing the original and perturbed input. We continued to explore the defensively distilled classifier network using TGSM, targeted BIM, and the C&W attacks. In all three cases defensive distillation was unable to block the perturbation.… view at source ↗

**Figure 3.** Figure 3: The effect of TGSM on a defensively distilled classifier network. (a) Logit values given normal input, (b) logit values for TGSM perturbed input, (c) perturbation shift, comparing the original and perturbed input. 4. Formal Analysis of Input Loss Gradients As indicated in Section 3, empirical exploration of the logit values of the defensively distilled two-class classifier led us to believe that defensive … view at source ↗

**Figure 4.** Figure 4: Comparing (a) normal and (b) defensively distilled softmax values. This analysis of the softmax outputs suggests it should be possible to optimize the defensive distillation process by using just a single network training phase. Training a single model with high distillation temperature in order to increase its certainty of the correct class label, and then setting the temperature to 1 during inference. In… view at source ↗

read the original abstract

Despite their accuracy, neural network-based classifiers are still prone to manipulation through adversarial perturbations. Those perturbations are designed to be misclassified by the neural network, while being perceptually identical to some valid input. The vast majority of attack methods rely on white-box conditions, where the attacker has full knowledge of the attacked network's parameters. This allows the attacker to calculate the network's loss gradient with respect to some valid input and use this gradient in order to create an adversarial example. The task of blocking white-box attacks has proven difficult to solve. While a large number of defense methods have been suggested, they have had limited success. In this work we examine this difficulty and try to understand it. We systematically explore the abilities and limitations of defensive distillation, one of the most promising defense mechanisms against adversarial perturbations suggested so far in order to understand the defense challenge. We show that contrary to commonly held belief, the ability to bypass defensive distillation is not dependent on an attack's level of sophistication. In fact, simple approaches, such as the Targeted Gradient Sign Method, are capable of effectively bypassing defensive distillation. We prove that defensive distillation is highly effective against non-targeted attacks but is unsuitable for targeted attacks. This discovery leads us to realize that targeted attacks leverage the same input gradient that allows a network to be trained. This implies that blocking them will require losing the network's ability to learn, presenting an impossible tradeoff to the research community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows distillation blocks non-targeted attacks but not simple targeted ones like TGSM, yet the leap to an impossible gradient tradeoff for any defense is an unproven interpretation.

read the letter

The core result is that defensive distillation stops non-targeted adversarial examples but fails against targeted ones, even when the attack is as basic as the Targeted Gradient Sign Method. They back the distinction with experiments and a proof that the defense works for one case but not the other. That separation is the useful part: it shows the bypass is not about attack complexity, which lines up with some earlier observations but makes the point more explicit here. The paper does a clean job of isolating the targeted versus non-targeted difference and tying it back to the shared gradient idea in the abstract. That framing is new enough to be worth noting. The soft spot is the final step. The results are specific to how distillation interacts with the two attack types. From there the authors conclude that targeted attacks must use the training gradient, so blocking them means losing the ability to learn. Nothing in the reported experiments or proof shows this is the only possible defense target or that every other architecture would face the same limit. It reads as an interpretive claim rather than a necessity derived from the math. The evidence supports the distillation finding but does not yet close the loop on the broader tradeoff. This is for researchers focused on why gradient-based defenses keep failing on targeted attacks. Someone already working on distillation or gradient alignment will find the targeted/non-targeted split worth checking. It is coherent on its own terms and engages the literature directly, so it deserves a serious referee even if the central implication needs more support or counter-examples. I would send it to review.

Referee Report

2 major / 1 minor

Summary. The paper claims that defensive distillation is highly effective against non-targeted adversarial attacks but unsuitable for targeted attacks such as the Targeted Gradient Sign Method (TGSM). Experiments show that simple targeted attacks bypass distillation, leading to the conclusion that targeted attacks leverage the same input gradient used in network training; therefore, blocking targeted attacks necessarily impairs the network's ability to learn, creating an impossible tradeoff for defenses.

Significance. If the central interpretive claim holds, the result would be significant for the adversarial robustness literature by identifying a potential fundamental limitation: any defense that neutralizes the shared gradient for targeted attacks would also prevent standard training. This could shift focus toward non-gradient-based or architecture-level defenses and explain why many white-box defenses have limited success.

major comments (2)

[Abstract] Abstract: The claim that 'targeted attacks leverage the same input gradient that allows a network to be trained' and that 'blocking them will require losing the network's ability to learn' is presented as following directly from the distillation bypass results, but no derivation, mathematical argument, or ablation is provided showing that the gradient is necessarily shared or that it is the only possible defense target. This step is load-bearing for the 'impossible tradeoff' conclusion.
[Abstract / conclusion paragraph] The manuscript reports differential effectiveness (distillation succeeds on non-targeted but fails on targeted TGSM) yet does not rule out alternative defense mechanisms that might neutralize targeted gradient exploitation without nullifying the training gradient; the generalization from distillation-specific results to all possible defenses therefore lacks supporting analysis.

minor comments (1)

The abstract would be clearer if it briefly stated the datasets, model architectures, and quantitative success rates of the TGSM bypass experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help clarify the scope and strength of our claims. We address the major comments point-by-point below. Our responses focus on the empirical nature of the work and the specific role of defensive distillation in revealing the gradient tradeoff.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'targeted attacks leverage the same input gradient that allows a network to be trained' and that 'blocking them will require losing the network's ability to learn' is presented as following directly from the distillation bypass results, but no derivation, mathematical argument, or ablation is provided showing that the gradient is necessarily shared or that it is the only possible defense target. This step is load-bearing for the 'impossible tradeoff' conclusion.

Authors: The manuscript's conclusion is indeed an interpretation based on the experimental findings rather than a formal derivation. Defensive distillation is designed to reduce the effectiveness of gradient-based attacks by smoothing the output probabilities. Its success against non-targeted attacks but failure against targeted TGSM indicates that targeted attacks rely on the precise gradient information that training also uses. We do not claim this is mathematically proven as the only possible mechanism; however, the results suggest that defenses targeting the gradient will face this tradeoff. We will revise the abstract to emphasize that this is an empirical observation leading to the interpretive claim, and add a short discussion on the lack of formal proof. revision: partial
Referee: [Abstract / conclusion paragraph] The manuscript reports differential effectiveness (distillation succeeds on non-targeted but fails on targeted TGSM) yet does not rule out alternative defense mechanisms that might neutralize targeted gradient exploitation without nullifying the training gradient; the generalization from distillation-specific results to all possible defenses therefore lacks supporting analysis.

Authors: We agree that the paper does not exhaustively rule out all conceivable alternative defenses. Our focus is on defensive distillation as a representative gradient-masking defense to demonstrate the differential effect. The generalization is that since distillation aims to block gradient exploitation but cannot do so for targeted attacks without (implicitly) affecting the usable gradient for training, it points to a broader challenge. We will add text in the conclusion to explicitly state that while other mechanisms might be possible, the distillation results highlight why gradient-based defenses struggle with targeted attacks, without claiming universality. revision: partial

Circularity Check

0 steps flagged

No circularity; central claim is interpretive inference from experiments, not reduction by construction.

full rationale

The paper's derivation proceeds from systematic experiments on defensive distillation (showing effectiveness against non-targeted attacks but failure on targeted ones such as TGSM) to the interpretive statement that targeted attacks leverage the training input gradient. This inference does not reduce to any self-definitional equation, fitted parameter renamed as prediction, or self-citation chain. No equations or prior-author uniqueness theorems are invoked to force the conclusion; the claim remains an observation external to the reported results and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the standard machine-learning assumption that backpropagation uses the input gradient for training; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The gradient of the loss with respect to the input is the mechanism both for generating targeted adversarial examples and for updating model parameters during training.
Invoked in the abstract's final paragraph to reach the 'impossible tradeoff' conclusion.

pith-pipeline@v0.9.0 · 5789 in / 1298 out tokens · 34901 ms · 2026-05-24T23:08:39.259712+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

targeted attacks leverage the same input gradient that allows a network to be trained... blocking them will require losing the network's ability to learn
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

defensive distillation is highly effective against non-targeted attacks but is unsuitable for targeted attacks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 9 internal anchors

[1]

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199

work page internal anchor Pith review Pith/arXiv arXiv 2013
[2]

(2016, March)

Papernot, N., et al. (2016, March). The limitations of deep learnin g in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on (pp. 372-387). IEEE

work page 2016
[3]

Explaining and Harnessing Adversarial Examples

Goodfellow, I., et al. (2014) Explaining and harnessing adversarial examples (2014). arXiv preprint arXiv:1412.6572

work page internal anchor Pith review Pith/arXiv arXiv 2014
[4]

Kurakin, A., et al. (201 6). Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Carlini, N., & Wagner, D. (2016b). Towards evaluating the robustness of neural networks. arXiv preprint arXiv:1608.04644

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Papernot, N., et al. (2015). Distillation as a defense to adversarial perturbations against deep neural networks. arXiv preprint arXiv:1511.04508

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

Kurakin, A., Goodfellow, I., & Bengio, S. (2016). Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

S., & Doshi -Velez, F

Ross, A. S., & Doshi -Velez, F. (2018, April). Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Thirty - Second AAAI Conference on Artificial Intelligence

work page 2018
[9]

LeCun, Y., Cortes, C., & Burges, C. J. (2010). MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2

work page 2010
[10]

A., & Campbell, N

Dunne, R. A., & Campbell, N. A. (1997). On the pairing of the Softmax activation and cross-entropy penalty functions and the derivation of the Softmax activation function. In Proc. 8th Aust. Conf. on the Neural Networks, Melbourne (Vol. 181, p. 185). 18

work page 1997
[11]

Akhtar, N., & Mian, A. (2018). Threat of adversarial attacks on deep learning in computer vision: A survey. arXiv preprint arXiv:1801.00553

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

M., Fawzi, A., Fa wzi, O., & Frossard, P

Moosavi-Dezfooli, S. M., Fawzi, A., Fa wzi, O., & Frossard, P. (2017). Universal adversarial perturbations. arXiv preprint

work page 2017
[13]

Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images (Vol. 1, No. 4, p. 7). Technical report, University of Toronto

work page 2009
[14]

B., & Swami, A

Papernot, N., McD aniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2017, April). Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security (pp. 506 -519). ACM

work page 2017
[15]

Keras MNIST CNN Tutorial https://github.com/keras- team/keras/blob/master/examples/mnist_cnn.py Accessed June 2019

work page 2019
[16]

(2017, November)

Carlini, N., & Wagner, D. (2017, November). Ad versarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (pp. 3-14). ACM

work page 2017
[17]

Athalye, A., Carlini, N., & Wagner, D. (2018). Obfuscated gradients give a false s ense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

B., & Swami, A

Papernot, N., McDaniel, P., Goodfe llow, I., Jha, S., Celik, Z. B., & Swami, A. (2017, April). Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security(pp. 506 -519). ACM

work page 2017
[20]

This derivation was first formally described in [10]

Classification and Loss Evaluatio n https://deepnotes.io/softmax-crossentropy accessed June 2019 19 Appendix A – Derivation of the Cross Entropy Loss and Softmax Activation Function This section describes the gradient derivation o f the cross -entropy loss combined with the softmax activation function. This derivation was first formally described in [10]....

work page 2019

[1] [1]

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199

work page internal anchor Pith review Pith/arXiv arXiv 2013

[2] [2]

(2016, March)

Papernot, N., et al. (2016, March). The limitations of deep learnin g in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on (pp. 372-387). IEEE

work page 2016

[3] [3]

Explaining and Harnessing Adversarial Examples

Goodfellow, I., et al. (2014) Explaining and harnessing adversarial examples (2014). arXiv preprint arXiv:1412.6572

work page internal anchor Pith review Pith/arXiv arXiv 2014

[4] [4]

Kurakin, A., et al. (201 6). Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Carlini, N., & Wagner, D. (2016b). Towards evaluating the robustness of neural networks. arXiv preprint arXiv:1608.04644

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Papernot, N., et al. (2015). Distillation as a defense to adversarial perturbations against deep neural networks. arXiv preprint arXiv:1511.04508

work page internal anchor Pith review Pith/arXiv arXiv 2015

[7] [7]

Kurakin, A., Goodfellow, I., & Bengio, S. (2016). Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

S., & Doshi -Velez, F

Ross, A. S., & Doshi -Velez, F. (2018, April). Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Thirty - Second AAAI Conference on Artificial Intelligence

work page 2018

[9] [9]

LeCun, Y., Cortes, C., & Burges, C. J. (2010). MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2

work page 2010

[10] [10]

A., & Campbell, N

Dunne, R. A., & Campbell, N. A. (1997). On the pairing of the Softmax activation and cross-entropy penalty functions and the derivation of the Softmax activation function. In Proc. 8th Aust. Conf. on the Neural Networks, Melbourne (Vol. 181, p. 185). 18

work page 1997

[11] [11]

Akhtar, N., & Mian, A. (2018). Threat of adversarial attacks on deep learning in computer vision: A survey. arXiv preprint arXiv:1801.00553

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

M., Fawzi, A., Fa wzi, O., & Frossard, P

Moosavi-Dezfooli, S. M., Fawzi, A., Fa wzi, O., & Frossard, P. (2017). Universal adversarial perturbations. arXiv preprint

work page 2017

[13] [13]

Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images (Vol. 1, No. 4, p. 7). Technical report, University of Toronto

work page 2009

[14] [14]

B., & Swami, A

Papernot, N., McD aniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2017, April). Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security (pp. 506 -519). ACM

work page 2017

[15] [15]

Keras MNIST CNN Tutorial https://github.com/keras- team/keras/blob/master/examples/mnist_cnn.py Accessed June 2019

work page 2019

[16] [16]

(2017, November)

Carlini, N., & Wagner, D. (2017, November). Ad versarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (pp. 3-14). ACM

work page 2017

[17] [17]

Athalye, A., Carlini, N., & Wagner, D. (2018). Obfuscated gradients give a false s ense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[19] [19]

B., & Swami, A

Papernot, N., McDaniel, P., Goodfe llow, I., Jha, S., Celik, Z. B., & Swami, A. (2017, April). Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security(pp. 506 -519). ACM

work page 2017

[20] [20]

This derivation was first formally described in [10]

Classification and Loss Evaluatio n https://deepnotes.io/softmax-crossentropy accessed June 2019 19 Appendix A – Derivation of the Cross Entropy Loss and Softmax Activation Function This section describes the gradient derivation o f the cross -entropy loss combined with the softmax activation function. This derivation was first formally described in [10]....

work page 2019