pith. sign in

arxiv: 2605.02701 · v1 · submitted 2026-05-04 · 🧮 math.OC · cs.LG· stat.ML

Robust and Fast Training via Per-Sample Clipping

Pith reviewed 2026-05-08 17:46 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML
keywords per-sample clippingstochastic gradient descentheavy-tailed noisenon-convex optimizationconvergence ratesrobust traininggradient clippingdeep neural networks
0
0 comments X

The pith

Per-sample gradient clipping in SGD achieves optimal convergence rates for non-convex problems under heavy-tailed noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops per-sample clipped SGD, a stochastic gradient method that clips the gradient contribution of each individual training sample before averaging within a mini-batch. It establishes that this estimator delivers the fastest known convergence rates in expectation for non-convex optimization when gradients exhibit heavy tails, and supplies matching high-probability bounds that lose only polylog factors in the failure probability. These guarantees matter because heavy-tailed gradient noise appears routinely in deep-network training, where standard SGD or batch-level clipping can converge more slowly or less reliably. The authors also report empirical gains over momentum SGD and conventional clipping when training AlexNet on CIFAR-100, even after the extra per-sample cost, and observe that clipping during gradient accumulation improves performance at negligible extra expense.

Core claim

By replacing the usual averaged gradient with a per-sample clipped version, the resulting PS-Clip-SGD algorithm attains optimal in-expectation convergence rates for non-convex stochastic optimization under heavy-tailed gradient noise and yields high-probability convergence guarantees that match those rates up to polylogarithmic factors in the failure probability.

What carries the argument

Per-sample clipped gradient estimator, which clips each sample's gradient individually before aggregation to control the influence of heavy-tailed outliers.

If this is right

  • Optimal in-expectation convergence rates are obtained for non-convex problems under the stated heavy-tailed noise model.
  • High-probability convergence bounds hold that differ from the expectation bounds by only polylog factors in the failure probability.
  • Empirical training of AlexNet on CIFAR-100 improves over both momentum SGD and batch-level clipping, even after accounting for per-sample overhead.
  • Applying clipping during gradient accumulation steps improves performance at almost zero extra cost, contrary to the usual practice of clipping only after accumulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may allow larger batch sizes in practice without sacrificing stability when tails are heavy.
  • It suggests that the timing of clipping relative to accumulation steps deserves systematic study across optimizers.
  • If the per-sample clipping cost can be amortized, the approach could extend naturally to other first-order methods that suffer from outlier gradients.

Load-bearing premise

The noise in the observed gradients must follow a heavy-tailed distribution possessing finite moments of a prescribed order.

What would settle it

On a synthetic non-convex problem whose gradient noise is independently verified to be light-tailed, PS-Clip-SGD would fail to match the claimed optimal rates and would perform no better than unclipped SGD.

Figures

Figures reproduced from arXiv: 2605.02701 by Davide Nobile, Philipp Grohs.

Figure 1
Figure 1. Figure 1: Performance of Normalized-SGD, Clip-SGD and PS-Clip-SGD for different noise regimes view at source ↗
Figure 2
Figure 2. Figure 2: (1 − δ)-quantile of the average gradient norm after T = 100 training steps, plotted against log(1/δ) for the three algorithms and different noise regimes. As before, due to the choice of parameters, Normalized SGD and Clip-SGD are indistinguishable in the plot. 4.2 Training AlexNet with per-sample clipping view at source ↗
Figure 3
Figure 3. Figure 3: Training and validation accuracies of SGD, Clip-SGD and PS-Clip-SGD, all with momen view at source ↗
Figure 4
Figure 4. Figure 4: Performance of Normalized-SGD, Clip-SGD and PS-Clip-SGD for different noise regimes view at source ↗
Figure 5
Figure 5. Figure 5: (1 − δ)-quantile of the average gradient norm after T = 100 training steps, plotted against log(1/δ) for the three algorithms and different noise regimes. The experiment is performed using the tuned hyperparameters from view at source ↗
Figure 6
Figure 6. Figure 6: Blue line, left y-axis: Average per-sample gradient norm in each epoch: 1 nbatches Pnbatches t=1 1 batch_size Pbatch_size i=1 |∇f(xt, ξ(i) t )|. Green line, right y-axis: average number of clipped gradients in a batch for PS-Clip-SGD: 1 nbatches Pnbatches t=1 1 batch_size Pbatch_size i=1 n (t,i) clipped. Note: nbatches indicates the total number of batches in an epoch, while n (t,i) clipped indicates the n… view at source ↗
read the original abstract

We propose a robust gradient estimator based on per-sample gradient clipping and analyze its properties both theoretically and empirically. We show that the resulting method, per-sample clipped SGD (PS-Clip-SGD), achieves optimal in-expectation convergence rates for non-convex optimization problems under heavy-tailed gradient noise. Moreover, we establish high-probability convergence guarantees that match the in-expectation rates up to polylogarithmic factors in the failure probability. We complement our theoretical results with multiple numerical experiments. In particular, we demonstrate that PS-Clip-SGD outperforms both vanilla SGD with momentum and standard gradient clipping when training AlexNet on the CIFAR-100 dataset, even after accounting for the additional computational time caused by per-sample clipping. We also empirically show that, in the presence of gradient accumulation, applying clipping at the mini-batch level can improve training performance while incurring virtually no additional computational cost. This finding is particularly interesting, as it contradicts the common practice of applying clipping only after all accumulation steps have been completed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes per-sample clipped SGD (PS-Clip-SGD) as a robust gradient estimator. It claims that this method achieves optimal in-expectation convergence rates for non-convex optimization under heavy-tailed gradient noise, along with high-probability convergence guarantees that match the in-expectation rates up to polylogarithmic factors in the failure probability. These theoretical results are supported by experiments demonstrating that PS-Clip-SGD outperforms vanilla SGD with momentum and standard gradient clipping when training AlexNet on CIFAR-100 (accounting for extra compute), and that mini-batch-level clipping during gradient accumulation can improve performance at negligible cost, contrary to common practice.

Significance. If the stated convergence results hold, the work provides a theoretically grounded clipping strategy with optimal rates under heavy-tailed noise assumptions that are relevant to deep learning. The matching high-probability bounds and the empirical observation on accumulation-stage clipping are practical strengths. The paper supplies conditional optimality claims and reproducible-style experiments as supporting elements.

major comments (1)
  1. [Theoretical analysis] The central optimality claim in the abstract rests on the gradient noise satisfying explicit heavy-tailed moment bounds. The manuscript should include a dedicated subsection (likely in the theoretical analysis) that states the precise moment conditions (e.g., which p-moments are finite) and shows how they yield the claimed optimal rate; without this explicit linkage the applicability to the reported AlexNet/CIFAR-100 runs remains conditional rather than verified.
minor comments (2)
  1. [Abstract] The abstract states that multiple numerical experiments were performed, yet only the AlexNet/CIFAR-100 run is described in detail; a one-sentence summary of the other experiments would improve completeness.
  2. [Experiments] In the experimental section, the comparison to baselines should explicitly state whether hyper-parameters for vanilla SGD and standard clipping were re-tuned on the same compute budget as PS-Clip-SGD; the current description leaves open the possibility that the reported gains partly reflect unequal tuning effort.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the minor revision recommendation. We address the single major comment below and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [Theoretical analysis] The central optimality claim in the abstract rests on the gradient noise satisfying explicit heavy-tailed moment bounds. The manuscript should include a dedicated subsection (likely in the theoretical analysis) that states the precise moment conditions (e.g., which p-moments are finite) and shows how they yield the claimed optimal rate; without this explicit linkage the applicability to the reported AlexNet/CIFAR-100 runs remains conditional rather than verified.

    Authors: We agree that a dedicated subsection would improve clarity and make the optimality claims self-contained. In the revised manuscript we will insert a new subsection (tentatively titled 'Moment Assumptions and Derivation of Optimal Rates') immediately after the problem setup in the theoretical analysis section. This subsection will (i) state the precise assumption that the stochastic gradient noise satisfies E[||noise||^p] ≤ σ^p for some p ∈ (1,2] and all samples, (ii) recall the standard heavy-tailed convergence result that yields the optimal in-expectation rate O(T^{-(p-1)/(2p-1)}) (or the specific rate proved in our theorems), and (iii) explicitly connect these conditions to the high-probability bounds. We will also add a short paragraph discussing why the CIFAR-100 experiments are consistent with the assumed regime. These additions require only a few paragraphs and do not alter any proofs or experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes per-sample clipped SGD (PS-Clip-SGD) and claims optimal in-expectation and high-probability convergence rates for non-convex problems under heavy-tailed gradient noise with moment bounds. These rates are derived from standard external optimization theory (e.g., typical SGD analyses adapted to clipping and tail assumptions) rather than reducing to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations within the paper. The abstract and context indicate the results are conditional on the stated noise model, with experiments as separate empirical validation. No self-definitional steps, ansatzes smuggled via citation, or renaming of known results as new derivations are present. The central claims remain independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard non-convex smoothness assumptions and a heavy-tailed noise model whose precise parameters are not enumerated in the abstract.

axioms (2)
  • domain assumption Gradient noise is heavy-tailed with bounded moments sufficient for the clipping analysis
    Invoked to obtain optimal rates; if violated, rates no longer hold.
  • standard math The objective is L-smooth and bounded below
    Standard assumption in non-convex SGD analysis.

pith-pipeline@v0.9.0 · 5470 in / 1275 out tokens · 43116 ms · 2026-05-08T17:46:07.678858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    Zur Elektrodynamik bewegter K \"o rper

    Albert Einstein. Zur Elektrodynamik bewegter K \"o rper . ( German ) [ On the electrodynamics of moving bodies]. Annalen der Physik. 1905

  2. [2]

    Journal of Computational Physics , year =

    Abrahamsen, Nilin and Ding, Zhiyan and Goldshlager, Gil and Lin, Lin , title =. Journal of Computational Physics , year =

  3. [3]

    Vershynin, Roman , title =

  4. [4]

    From Gradient Clipping to Normalization for Heavy Tailed

    Florian H. From Gradient Clipping to Normalization for Heavy Tailed. OPT 2024: Optimization for Machine Learning , year=

  5. [5]

    Advances in Neural Information Processing Systems , volume =

    Zhang, Jingzhao and Karimireddy, Sai Praneeth and Veit, Andreas and Kim, Seungyeon and Reddi, Sashank and Kumar, Sanjiv and Sra, Suvrit , title =. Advances in Neural Information Processing Systems , volume =

  6. [6]

    Matekon , volume =

    Nesterov, Yurii Evgenievich , title =. Matekon , volume =

  7. [7]

    2017 , booktitle =

    Kohler, Jonas Moritz and Lucchi, Aurelien , title =. 2017 , booktitle =

  8. [8]

    IEEE Trans

    Bubeck, Sebastien and Cesa-Bianchi, Nicolo and Lugosi, Gabor , title =. IEEE Trans. Inf. Theor. , pages =. 2013 , issue_date =

  9. [9]

    2024 , howpublished =

    Enabling Fast Gradient Clipping and Ghost Clipping in Opacus , author =. 2024 , howpublished =

  10. [10]

    Large Language Models Can Be Strong Differentially Private Learners , booktitle =

    Li, Xuechen and Tram. Large Language Models Can Be Strong Differentially Private Learners , booktitle =. 2022 , eprint =

  11. [11]

    Proceedings on Privacy Enhancing Technologies , year =

    Scaling Up Differentially Private Deep Learning with Fast Per-Example Gradient Clipping , author =. Proceedings on Privacy Enhancing Technologies , year =

  12. [12]

    Proceedings of the 36th International Conference on Machine Learning , pages =

    A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , volume =

  13. [13]

    Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =

    Revisiting the Noise Model of Stochastic Gradient Descent , author =. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =. 2024 , volume =

  14. [14]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  15. [15]

    International Conference on Learning Representations (ICLR) , year =

    Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity , author =. International Conference on Learning Representations (ICLR) , year =

  16. [16]

    Advances in Neural Information Processing Systems , volume =

    High-Probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails , author =. Advances in Neural Information Processing Systems , volume =

  17. [17]

    2023 , eprint =

    High Probability Convergence of Clipped-SGD under Heavy-Tailed Noise , author =. 2023 , eprint =

  18. [18]

    Advances in Neural Information Processing Systems , volume =

    Improved Convergence in High Probability of Clipped Gradient Methods with Heavy-Tailed Noise , author =. Advances in Neural Information Processing Systems , volume =

  19. [19]

    SIAM Journal on Optimization , volume =

    Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , author =. SIAM Journal on Optimization , volume =

  20. [20]

    Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , pages =

    A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent , author =. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , pages =. 2020 , volume =

  21. [21]

    Proceedings of the 36th International Conference on Machine Learning , pages =

    Gower, Robert Mansel and Loizou, Nicolas and Qian, Xun and Sailanbayev, Alibek and Shulgin, Egor and Richt. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , volume =

  22. [22]

    Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , volume =

    Moulines, Eric and Bach, Francis , booktitle =. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , volume =

  23. [23]

    Better Theory for

    Ahmed Khaled and Peter Richt. Better Theory for. Transactions on Machine Learning Research , issn=

  24. [24]

    2020 , eprint =

    A High Probability Analysis of Adaptive SGD with Momentum , author =. 2020 , eprint =

  25. [25]

    Journal of Machine Learning Research , year =

    High Probability Convergence Bounds for Non-Convex Stochastic Gradient Descent with Sub-Weibull Noise , author =. Journal of Machine Learning Research , year =

  26. [26]

    Proceedings of the 39th International Conference on Machine Learning , series =

    High Probability Guarantees for Nonconvex Stochastic Gradient Descent with Heavy Tails , author =. Proceedings of the 39th International Conference on Machine Learning , series =

  27. [27]

    2024 , howpublished =

    Gradient Clipping and Accumulation , author =. 2024 , howpublished =

  28. [28]

    2026 , howpublished =

    Gradient Accumulation , author =. 2026 , howpublished =

  29. [29]

    Gradient Accumulation: Memory-Efficient Large Batch Training , author =

  30. [30]

    2022 , howpublished =

    nanoGPT , author =. 2022 , howpublished =

  31. [31]

    The Annals of Mathematical Statistics , number =

    Bengt von Bahr and Carl-Gustav Esseen , title =. The Annals of Mathematical Statistics , number =

  32. [32]

    Mathematical Programming , volume =

    Lower Bounds for Non-Convex Stochastic Optimization , author =. Mathematical Programming , volume =

  33. [33]

    Proceedings of the 39th International Conference on Machine Learning , series =

    Understanding Clipping for Federated Learning: Convergence and Client-Level Differential Privacy , author =. Proceedings of the 39th International Conference on Machine Learning , series =

  34. [34]

    The Twelfth International Conference on Learning Representations , year=

    An improved analysis of per-sample and per-update clipping in federated learning , author=. The Twelfth International Conference on Learning Representations , year=

  35. [35]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume =

    Differentially Private Learning with Per-Sample Adaptive Clipping , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =

  36. [36]

    Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , series =

    Deep Learning with Differential Privacy , author =. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , series =

  37. [37]

    Language Models are Few-Shot Learners , volume =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  38. [38]

    torchvision.models.alexnet , howpublished =

  39. [39]

    OpenWebText Corpus , author=

  40. [40]

    Advances in Neural Information Processing Systems , year =

    ImageNet Classification with Deep Convolutional Neural Networks , author =. Advances in Neural Information Processing Systems , year =

  41. [41]

    Learning Multiple Layers of Features from Tiny Images , author =