Defending Adversarial Attacks by Correcting logits
Pith reviewed 2026-05-25 15:26 UTC · model grok-4.3
The pith
A two-layer network can correct logits to recover accurate predictions from adversarial attacks without using image data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A two-layer network trained on mixed clean and perturbed logits learns to recover the original class scores, thereby defending against a wide range of adversarial attacks by correcting the logits before the final prediction step.
What carries the argument
A two-layer network that takes logits as input and outputs corrected logits to restore the original prediction.
If this is right
- The defender maintains relatively high accuracy against a wide range of adversarial attacks.
- Performance transfers to attackers that share similar properties.
- Defense succeeds in scenarios where image data are unavailable.
- The approach yields high interpretability especially at the semantic level.
Where Pith is reading between the lines
- Adversarial effects on logits may follow learnable, structured patterns rather than pure noise.
- Logit-only correction could extend to black-box settings where only model outputs are accessible.
- Semantic-level interpretability might allow targeted debugging of which class scores are most vulnerable.
Load-bearing premise
Patterns observed in logits from a training mix of clean and attacked examples are enough for the two-layer network to correct logits produced by new attacks on unseen data.
What would settle it
Testing the trained two-layer corrector on a new attack type absent from the training mixture and measuring whether defense accuracy drops sharply.
Figures
read the original abstract
Generating and eliminating adversarial examples has been an intriguing topic in the field of deep learning. While previous research verified that adversarial attacks are often fragile and can be defended via image-level processing, it remains unclear how high-level features are perturbed by such attacks. We investigate this issue from a new perspective, which purely relies on logits, the class scores before softmax, to detect and defend adversarial attacks. Our defender is a two-layer network trained on a mixed set of clean and perturbed logits, with the goal being recovering the original prediction. Upon a wide range of adversarial attacks, our simple approach shows promising results with relatively high accuracy in defense, and the defender can transfer across attackers with similar properties. More importantly, our defender can work in the scenarios that image data are unavailable, and enjoys high interpretability especially at the semantic level.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a defense against adversarial attacks that operates purely on logits (pre-softmax class scores) rather than images. A two-layer network is trained on a mixture of clean and adversarially perturbed logits with the objective of recovering the original (correct) prediction. The abstract asserts that this yields promising defense accuracy, transfers across attackers with similar properties, functions when image data are unavailable, and provides high semantic-level interpretability.
Significance. If the empirical claims hold and the two-layer corrector generalizes beyond the training attacks, the approach would be notable for enabling defense without pixel-level access or model gradients and for offering a degree of interpretability at the logit level. This would distinguish it from most image-processing or adversarial-training defenses in the literature.
major comments (2)
- [Abstract] Abstract: the central claim that 'the defender can transfer across attackers with similar properties' and yields 'promising results with relatively high accuracy' is unsupported by any datasets, attack methods, accuracy numbers, baselines, or experimental protocol. Without these, it is impossible to determine whether the two-layer network recovers predictions via shared logit structure or merely fits attack-specific patterns.
- [Abstract] The generalization argument rests on the unstated assumption that logit perturbations induced by different attacks share consistent, learnable structure across examples and attackers. No analysis, ablation, or visualization of logit deltas is supplied to test whether the correction is universal rather than attack-dependent; if the deltas are largely attack-specific, the trained corrector will fail on held-out attacks and new data.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments on the abstract below, clarifying the support provided in the full manuscript while agreeing to strengthen the abstract for clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'the defender can transfer across attackers with similar properties' and yields 'promising results with relatively high accuracy' is unsupported by any datasets, attack methods, accuracy numbers, baselines, or experimental protocol. Without these, it is impossible to determine whether the two-layer network recovers predictions via shared logit structure or merely fits attack-specific patterns.
Authors: The abstract serves as a high-level summary. The full manuscript provides the requested details in the Experiments section, including datasets (CIFAR-10, MNIST), attack methods (FGSM, PGD, CW, and others), specific accuracy numbers for defense performance, baseline comparisons, and the cross-attacker transfer protocol. These experiments demonstrate recovery via logit correction rather than attack-specific fitting. We will revise the abstract to reference these elements and include representative accuracy figures. revision: yes
-
Referee: [Abstract] The generalization argument rests on the unstated assumption that logit perturbations induced by different attacks share consistent, learnable structure across examples and attackers. No analysis, ablation, or visualization of logit deltas is supplied to test whether the correction is universal rather than attack-dependent; if the deltas are largely attack-specific, the trained corrector will fail on held-out attacks and new data.
Authors: The manuscript supports the generalization claim through explicit transfer experiments across attackers sharing similar properties (detailed in the results), which empirically indicate learnable shared structure in logit perturbations. We agree that direct analysis of logit deltas would provide additional substantiation and will add visualizations and ablations of these deltas in the revision. revision: partial
Circularity Check
No circularity: empirical training of logit corrector is self-contained
full rationale
The paper describes an empirical method: a two-layer network is trained on a mixture of clean and attacked logits to recover original predictions, with reported transfer across similar attackers. No derivation chain, first-principles result, or prediction is claimed that reduces by the paper's own equations or self-citations to its inputs. Performance assertions rest on experimental outcomes rather than tautological fits or load-bearing self-references. The central assumption about shared logit perturbation structure is an empirical hypothesis, not a definitional equivalence.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Dana Angluin and Philip D. Laird. Learning from noisy examples. Machine Learning, 2:343–370, 1988
work page 1988
-
[2]
Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018
work page 2018
-
[3]
Jacob Buckman, Aurko Roy, Colin A. Raffel, and Ian J. Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In ICLR, 2018
work page 2018
-
[4]
Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. InIEEE Symposium on Security and Privacy (SP) , 2017
work page 2017
-
[5]
Boosting adversarial attacks with momentum
Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In CVPR, 2018
work page 2018
-
[6]
Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel M. Roy. A study of the effect of jpg compression on adversarial images. CoRR, abs/1608.00853, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
Training deep neural-networks using a noise adaptation layer
Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adaptation layer. In ICLR, 2017
work page 2017
-
[8]
Goodfellow, Jonathon Shlens, and Christian Szegedy
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015
work page 2015
-
[9]
Countering adversarial images using input transformations
Chuan Guo, Mayank Rana, Moustapha Cissé, and Laurens van der Maaten. Countering adversarial images using input transformations. In ICLR, 2018
work page 2018
-
[10]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016
work page 2016
-
[11]
Improving neural networks by preventing co-adaptation of feature detectors
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[12]
Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017
work page 2017
-
[13]
Harini Kannan, Alexey Kurakin, and Ian J. Goodfellow. Adversarial logit pairing. In NeurIPS, 2018
work page 2018
-
[14]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015
work page 2015
-
[15]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012
work page 2012
-
[16]
Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In ICLR Workshop, 2017
work page 2017
-
[17]
Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In ICLR, 2017
work page 2017
-
[18]
Defense against adversarial attacks using high-level representation guided denoiser
Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Jun Zhu, and Xiaolin Hu. Defense against adversarial attacks using high-level representation guided denoiser. In CVPR, 2018
work page 2018
-
[19]
Towards deep learning models resistant to adversarial attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018
work page 2018
-
[20]
Deepfool: A simple and accurate method to fool deep neural networks
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In CVPR, 2016
work page 2016
-
[21]
Biologically inspired protection of deep networks from adversarial attacks
Aran Nayebi and Surya Ganguli. Biologically inspired protection of deep networks from adversarial attacks. CoRR, abs/1703.09202, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Gibson, Orr Dunkelman, and Daniel Pérez-Cabo
Margarita Osadchy, Julio Hernandez-Castro, Stuart J. Gibson, Orr Dunkelman, and Daniel Pérez-Cabo. No bot expects the deepcaptcha! introducing immutable adversarial examples, with applications to captcha generation. In IEEE Transactions on Information F orensics and Security, volume 12, pages 2640–2653, 2017. 9
work page 2017
-
[23]
Technical Report on the CleverHans v2.1.0 Adversarial Examples Library
Nicolas Papernot, Fartash Faghri, Nicholas Carlini, Ian Goodfellow, Reuben Feinman, Alexey Kurakin, Cihang Xie, Yash Sharma, Tom Brown, Aurko Roy, Alexander Matyasko, Vahid Behzadan, Karen Hambardzumyan, Zhishuai Zhang, Yi-Lin Juang, Zhi Li, Ryan Sheatsley, Abhibhav Garg, Jonathan Uesato, Willi Gierke, Yinpeng Dong, David Berthelot, Paul Hendricks, Jonas ...
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[24]
McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami
Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy (SP), 2016
work page 2016
-
[25]
Nicolas Papernot, Patrick D. McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security , 2017
work page 2017
-
[26]
Automatic differentiation in pytorch
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NeurIPS Workshop, 2017
work page 2017
-
[27]
Making deep neural networks robust to label noise: A loss correction approach
Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017
work page 2017
-
[28]
Aaditya Prakash, Nick Moran, Solomon Garber, Antonella DiLillo, and James A. Storer. Deflecting adversarial attacks with pixel deflection. In CVPR, 2018
work page 2018
-
[29]
The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
Kevin Roth, Yannic Kilcher, and Thomas Hofmann. The odds are odd: A statistical test for detecting adversarial examples. CoRR, abs/1902.04818, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[30]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. In IJCV, volume 115, pages 211–252, 2015
work page 2015
-
[31]
Very deep convolutional networks for large-scale image recogni- tion
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. In ICLR, 2015
work page 2015
-
[32]
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014
work page 2014
-
[33]
Goodfellow, Dan Boneh, and Patrick D
Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian J. Goodfellow, Dan Boneh, and Patrick D. McDaniel. Ensemble adversarial training: Attacks and defenses. In ICLR, 2018
work page 2018
-
[34]
Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan L. Yuille. Adversarial examples for semantic segmentation and object detection. In ICCV, 2017
work page 2017
-
[35]
Mitigating adversarial effects through randomization
Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Loddon Yuille. Mitigating adversarial effects through randomization. In ICLR, 2018
work page 2018
-
[36]
Feature Denoising for Improving Adversarial Robustness
Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Loddon Yuille, and Kaiming He. Feature denoising for improving adversarial robustness. CoRR, abs/1812.03411, 2018. 10 A Supporting classes of different attacks In Figure 3, we illustrate the supporting classes of defending PGD [ 19], MIM [5], DeepFool [20] and C&W [4] on ResNet-50 [10], respectively. Just...
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.