Fast Adversarial Attacks with Gradient Prediction

Aleksandr V. Petrov; Kamil Ciosek; Konstantina Palla; Nicol\`o Felicioni

arxiv: 2605.14868 · v1 · pith:LAZE2RAYnew · submitted 2026-05-14 · 💻 cs.LG

Fast Adversarial Attacks with Gradient Prediction

Kamil Ciosek , Aleksandr V. Petrov , Nicol\`o Felicioni , Konstantina Palla This is my paper

Pith reviewed 2026-06-30 21:15 UTC · model grok-4.3

classification 💻 cs.LG

keywords adversarial attacksgradient predictionlinear regressionneural tangent kernelFGSMthroughputrobustness evaluationadversarial training

0 comments

The pith

Linear regression on forward hidden states predicts input gradients accurately enough to replace the backward pass in fast adversarial attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that adversarial examples can be generated without a backward pass by using a linear model to predict the input gradient from the hidden states observed during the forward pass. This approach is exact in the infinite-width Neural Tangent Kernel limit but the authors demonstrate it works for practical finite networks. If correct, it allows generating adversarial examples at much higher throughput than standard methods like FGSM while retaining most of their effectiveness. A sympathetic reader would care because generating many adversarial examples is a bottleneck in evaluating and training robust models. The result suggests a general way to speed up gradient-based attacks under time constraints.

Core claim

The central discovery is that a lightweight linear regressor fitted to map forward-pass hidden states to input gradients can generate adversarial perturbations that recover much of FGSM's attack performance, yielding a 532% increase in throughput by eliminating the need for a backward pass. The method is motivated by a kernel view of neural networks and is exact in the NTK regime.

What carries the argument

linear regression fitted on forward-pass hidden states to predict input gradients

If this is right

Adversarial examples can be generated at significantly higher throughput for robustness evaluation and red-teaming.
Adversarial training can use more examples within the same compute budget.
The approach extends to other gradient-based attacks that rely on input gradients.
Practical finite-width networks still benefit even outside the exact NTK regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar prediction techniques might accelerate other methods that require input gradients, such as certain optimization or interpretability tools.
If the linear predictor generalizes across architectures, it could reduce the cost of large-scale adversarial robustness benchmarks.
Testing the method on very deep or recurrent networks would reveal how far the NTK-motivated approximation holds.
The throughput gains could enable real-time adversarial example generation in some applications.

Load-bearing premise

A linear regressor fitted on forward-pass hidden states can produce input gradients sufficiently accurate to generate effective adversarial perturbations on finite-width networks outside the exact NTK regime.

What would settle it

Measuring attack success rate on a standard image classification model and finding that the predicted-gradient attacks achieve substantially lower success rates than FGSM at the same perturbation budget.

Figures

Figures reproduced from arXiv: 2605.14868 by Aleksandr V. Petrov, Kamil Ciosek, Konstantina Palla, Nicol\`o Felicioni.

**Figure 2.** Figure 2: PGD scaling on HarmBench. Curves: PGD with exact gradient, PGD-GPR . hyperparameters is described in appendix B.1, together with a benchmark reporting how sensitive we are to hyperparameter choice. Main Results [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: FGSM with mixed gradient (see section 5.3). [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity to perturbation budget ϵ and predictor-training augmentation weight decay. 9 12 15 18 21 24 Target layer (early exit) 0.25 0.30 0.35 0.40 0.45 Cosine similarity Qwen3-4B Qwen2.5-3B Llama-2-7B 9 12 15 18 21 24 Target layer (early exit) 0 5 10 15 20 25 FGSM-GPR ASR (%) (a) Predictor cosine and ASR vs. target layer ℓ. 9 12 15 18 21 24 Target layer (early exit) 4 6 8 10 12 Speedup (×) Qwen3-4B Qwen… view at source ↗

**Figure 5.** Figure 5: Effect of the early-exit layer ℓ. Layer choice. The early-exit layer ℓ trades depth (richer representation) against speed. Figure 5a shows predictor cosine similarity and resulting ASR as a function of ℓ, and Figure 5b shows the corresponding speedup over full backpropagation. Cosine similarity is not a perfect proxy for ASR ( [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Cosine similarity of predicted gradients vs. resulting ASR on Llama-2-7B. The relationship [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Per-method successful-attack throughput on HarmBench. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Generating adversarial examples at scale is a core primitive for robustness evaluation, adversarial training, and red-teaming, yet even "fast" attacks such as FGSM remain throughput-limited by the cost of a backward pass. We introduce a family of attacks that eliminates the backward pass by predicting the input gradient from forward-pass hidden states via a lightweight linear regression. The approach is motivated by a kernel view of neural networks and is exact in the Neural Tangent Kernel regime, while remaining effective for practical finite-width models. Empirically, our methods recover much of FGSM's attack performance while using only a small fraction of the time, corresponding to a $532\%$ increase in throughput. These results suggest gradient prediction as a simple and general route to significantly faster adversarial generation under realistic wall-clock constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper suggests predicting input gradients via linear regression on hidden states to skip backprop in attacks, but supplies no experiments or error analysis to support the claims.

read the letter

The main takeaway is that this work proposes skipping the backward pass in FGSM by fitting a linear regressor to predict input gradients from hidden states, claiming a large throughput boost while keeping attack strength close to the original.

What is new is the concrete application of a linear model on forward activations for this purpose, building on NTK ideas but targeting the attack primitive directly. The abstract presents it cleanly as exact in the infinite-width case and practical otherwise.

The paper does well in identifying a real bottleneck in adversarial generation and suggesting a simple workaround. The motivation from kernel views is straightforward.

The soft spots center on the missing evidence. The abstract mentions empirical recovery of FGSM performance and the 532% gain but gives no experimental details, no description of how or on what data the regressor was trained, and no analysis of prediction error. Without those, it is impossible to tell if the approximation holds for the finite networks that matter in practice.

The stress-test concern holds up: the linear fit quality is not shown to be sufficient outside the NTK limit, so the generated examples might not be effective enough to justify the method. This is a load-bearing assumption with no support provided.

This paper is for people thinking about efficient ways to run many attacks, such as in large robustness studies. A reader would get value only if the full version includes the missing validation. It does not deserve a serious referee in its current state.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces a family of adversarial attacks that replace the backward pass of FGSM with a linear regression predictor of the input gradient from forward-pass hidden states. The construction is motivated by a kernel view of neural networks and is exact in the infinite-width NTK regime; the authors claim it remains effective for practical finite-width models, recovering much of FGSM attack success while delivering a 532% throughput increase.

Significance. If the empirical claims are substantiated, the work would provide a practical route to higher-throughput adversarial example generation, which is relevant for robustness evaluation and adversarial training pipelines. The NTK-derived motivation supplies a clear theoretical starting point, though the finite-width extension rests entirely on empirical fit quality.

major comments (2)

[Abstract] Abstract: the central claim that the methods 'recover much of FGSM's attack performance' while achieving a '532% increase in throughput' is stated without any description of experimental protocol, regressor training procedure, validation set, baselines, number of runs, or statistical tests. Soundness of the throughput and attack-success claims cannot be assessed from the supplied text.
[Abstract] Abstract / method description: the linear regressor is fitted on hidden-state/gradient pairs drawn from data, yet the paper presents the output as a 'prediction' whose accuracy for finite-width networks is asserted without analytic bounds or error analysis outside the exact NTK limit. If fit quality degrades on deeper architectures or shifted distributions, the generated perturbations may lose effectiveness, directly undermining the claimed throughput advantage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating where revisions will be made to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the methods 'recover much of FGSM's attack performance' while achieving a '532% increase in throughput' is stated without any description of experimental protocol, regressor training procedure, validation set, baselines, number of runs, or statistical tests. Soundness of the throughput and attack-success claims cannot be assessed from the supplied text.

Authors: We agree that the abstract's brevity prevents inclusion of full experimental details. The complete manuscript describes the regressor training on hidden-state/gradient pairs from a held-out validation set, the FGSM baseline, multiple runs with statistical reporting, and throughput measurements under controlled conditions. In the revision we will expand the abstract with a concise reference to the empirical protocol and add a dedicated experimental setup subsection early in the paper to make these elements immediately accessible. revision: yes
Referee: [Abstract] Abstract / method description: the linear regressor is fitted on hidden-state/gradient pairs drawn from data, yet the paper presents the output as a 'prediction' whose accuracy for finite-width networks is asserted without analytic bounds or error analysis outside the exact NTK limit. If fit quality degrades on deeper architectures or shifted distributions, the generated perturbations may lose effectiveness, directly undermining the claimed throughput advantage.

Authors: The construction is exact only in the infinite-width NTK limit; for finite-width networks the method is presented as an empirical approximation whose utility is validated through attack-success and throughput measurements. We do not supply analytic error bounds outside the NTK regime, as obtaining such bounds for practical finite networks is an open theoretical question not addressed in this work. The manuscript instead reports consistent recovery of FGSM performance across the evaluated models and data distributions. We will add quantitative fit-quality metrics (e.g., gradient prediction MSE on held-out data) to the revised version to better characterize when the approximation remains reliable. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method is explicitly data-dependent regression with empirical validation

full rationale

The paper presents an explicit algorithmic method: fit a linear regressor on forward-pass hidden states to approximate input gradients, motivated by exactness in the NTK limit. This is not a first-principles derivation whose output is smuggled back as input; the regression coefficients are learned from data as part of the attack construction, and performance claims are supported by direct empirical comparison to FGSM rather than by any self-referential reduction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to justify core steps. The approach is self-contained as a practical approximation technique whose validity rests on measured attack success rates, not on tautological equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on a fitted linear regressor and on the assumption that NTK-style gradient prediction transfers to finite networks; both introduce free parameters and domain assumptions not supplied by upstream literature.

free parameters (1)

linear regression coefficients
Weights of the lightweight linear model that maps hidden states to predicted input gradients; fitted to data.

axioms (1)

domain assumption Gradient prediction from hidden states is exact in the Neural Tangent Kernel regime
Invoked to motivate the approach for practical finite-width models.

pith-pipeline@v0.9.1-grok · 5664 in / 1202 out tokens · 25262 ms · 2026-06-30T21:15:51.267790+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 7 canonical work pages · 5 internal anchors

[1]

Wojciech Marian Czarnecki, Grzegorz ´Swirszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, and Koray Kavukcuoglu

URLhttps://arxiv.org/abs/2511.05187. Wojciech Marian Czarnecki, Grzegorz ´Swirszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, and Koray Kavukcuoglu. Understanding synthetic gradients and decoupled neural interfaces. In International Conference on Machine Learning, pages 904–912. PMLR,

work page arXiv
[2]

Goodfellow, Jonathon Shlens, and Christian Szegedy

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,

2015
[3]

Explaining and Harnessing Adversarial Examples

URLhttp://arxiv.org/abs/1412.6572. Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. InProceedings of the 35th International Conference on Machine Learning, ICML 2018, July

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Black-box Adversarial Attacks with Limited Queries and Information

URLhttps://arxiv.org/abs/1804.08598. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Kenneth Li, Oam Patel, Fernanda B

URLhttps://openreview.net/forum?id=B1EA-M-0Z. Kenneth Li, Oam Patel, Fernanda B. Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. In Alice Oh, Tris- tan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing S...

2023
[6]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu

URL http://papers.nips.cc/paper_files/paper/2023/hash/ 81b8390039b7302c909cb769f8b6cd93-Abstract-Conference.html. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Cana...

2023
[7]

Gaussian Process Behaviour in Wide Deep Neural Networks

URL https://openreview.net/forum?id= rJzIBfZAb. Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks.arXiv preprint arXiv:1804.11271,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Davis, Gavin Taylor, and Tom Goldstein

URL https: //arxiv.org/abs/1904.12843. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page arXiv 1904
[10]

Goodfellow, and Rob Fergus

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In Yoshua Bengio and Yann LeCun, editors,2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings,

2014
[11]

Metamorphictestingoflarge languagemodelsfornaturallanguageprocessing.doi:10.48550/arXiv

URL https://arxiv.org/ abs/1905.00877. Andy Zou, Long Phan, Sarah Li Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representa...

work page internal anchor Pith review doi:10.48550/arxiv 1905
[12]

The augmentation procedure is GCG-trajectory augmentation plus random one-token suffix perturbations as described above. • Evaluation and timing.We use the HarmBench custom judge for HarmBench and Jail- breakJudge for JailbreakBench; JailbreakJudge is Llama-3-70B-Instruct quantized and served with Ollama. Attacked-model responses use default generation pa...

2006

[1] [1]

Wojciech Marian Czarnecki, Grzegorz ´Swirszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, and Koray Kavukcuoglu

URLhttps://arxiv.org/abs/2511.05187. Wojciech Marian Czarnecki, Grzegorz ´Swirszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, and Koray Kavukcuoglu. Understanding synthetic gradients and decoupled neural interfaces. In International Conference on Machine Learning, pages 904–912. PMLR,

work page arXiv

[2] [2]

Goodfellow, Jonathon Shlens, and Christian Szegedy

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,

2015

[3] [3]

Explaining and Harnessing Adversarial Examples

URLhttp://arxiv.org/abs/1412.6572. Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. InProceedings of the 35th International Conference on Machine Learning, ICML 2018, July

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Black-box Adversarial Attacks with Limited Queries and Information

URLhttps://arxiv.org/abs/1804.08598. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Kenneth Li, Oam Patel, Fernanda B

URLhttps://openreview.net/forum?id=B1EA-M-0Z. Kenneth Li, Oam Patel, Fernanda B. Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. In Alice Oh, Tris- tan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing S...

2023

[6] [6]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu

URL http://papers.nips.cc/paper_files/paper/2023/hash/ 81b8390039b7302c909cb769f8b6cd93-Abstract-Conference.html. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Cana...

2023

[7] [7]

Gaussian Process Behaviour in Wide Deep Neural Networks

URL https://openreview.net/forum?id= rJzIBfZAb. Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks.arXiv preprint arXiv:1804.11271,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Davis, Gavin Taylor, and Tom Goldstein

URL https: //arxiv.org/abs/1904.12843. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page arXiv 1904

[10] [10]

Goodfellow, and Rob Fergus

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In Yoshua Bengio and Yann LeCun, editors,2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings,

2014

[11] [11]

Metamorphictestingoflarge languagemodelsfornaturallanguageprocessing.doi:10.48550/arXiv

URL https://arxiv.org/ abs/1905.00877. Andy Zou, Long Phan, Sarah Li Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representa...

work page internal anchor Pith review doi:10.48550/arxiv 1905

[12] [12]

The augmentation procedure is GCG-trajectory augmentation plus random one-token suffix perturbations as described above. • Evaluation and timing.We use the HarmBench custom judge for HarmBench and Jail- breakJudge for JailbreakBench; JailbreakJudge is Llama-3-70B-Instruct quantized and served with Ollama. Attacked-model responses use default generation pa...

2006