pith. machine review for the scientific record. sign in

arxiv: 2605.07634 · v1 · submitted 2026-05-08 · 🧮 math.OC · cs.LG· math.ST· stat.TH

Recognition: 2 theorem links

· Lean Theorem

Robust stochastic first order methods in heavy-tailed noise via medoid mini-batch gradient sampling

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:32 UTC · model grok-4.3

classification 🧮 math.OC cs.LGmath.STstat.TH
keywords stochastic gradient descentheavy-tailed noiserobust optimizationnon-convex optimizationmedoid samplingmini-batch methods
0
0 comments X

The pith

Medoid selection on mini-batch gradients yields O(T^{-1}) convergence for non-convex stochastic optimization under heavy-tailed noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops R-SGD-Mini, which splits each batch of K samples into M chunks, computes one stochastic gradient per chunk, and takes the medoid gradient for the parameter update. Under symmetric heavy-tailed noise with possibly infinite variance and standard non-convex smoothness, the expected time-averaged squared gradient norm is shown to converge at rate O(T^{-1}) to a noise-dependent neighborhood of zero. With a known time horizon the rate becomes O(T^{-1/2}), and clipping the gradients produces matching high-probability bounds. A reader would care because this provides an alternative robustness mechanism to clipping or normalization that works directly with the batch structure.

Core claim

Under a general class of symmetric heavy-tailed gradient noises and standard non-convex assumptions, R-SGD-Mini establishes explicit bounds showing that the expected time-averaged squared gradient norm converges at rate O(T^{-1}) to a small neighborhood of zero characterized by the noise and parameters; the rate improves to O(T^{-1/2}) when the horizon is known in advance, and clipping yields high-probability convergence at the same rates.

What carries the argument

Medoid mini-batch gradient sampling: the K-sample batch is partitioned into M chunks, a gradient is formed for each chunk, and the update uses the medoid gradient among the M vectors.

If this is right

  • The method converges without requiring finite variance or higher moments on the stochastic gradients.
  • The neighborhood of convergence is explicitly characterized in terms of noise parameters and algorithm constants.
  • High-probability guarantees are recovered by incorporating gradient clipping.
  • Empirical performance is competitive with or better than SGD, clipped SGD, and median-of-means methods on the tested problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This selection strategy could be tested on other heavy-tailed settings such as distributed optimization with communication noise.
  • Combining the medoid step with adaptive step sizes or momentum might further improve practical rates beyond the proved bounds.
  • The approach suggests a broader class of order-statistic based gradient estimators for robust stochastic optimization.

Load-bearing premise

The stochastic gradient noise is symmetric heavy-tailed and the medoid selection step produces no additional bias outside the analyzed terms.

What would settle it

Running the algorithm on a non-convex test problem with Cauchy noise and observing that the time-averaged squared gradient norm does not decrease toward the predicted neighborhood at rate O(T^{-1}).

Figures

Figures reproduced from arXiv: 2605.07634 by Dusan Jakovetic, Manojlo Vukovic.

Figure 1
Figure 1. Figure 1: Evolution of the average iterates across Monte Carlo runs. [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Test Accuracy over Iterations on CIFAR-10 (ResNet-18) [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
read the original abstract

We consider a first order stochastic optimization framework where, at each iteration, $K$ independent identically distributed (i.i.d.) data point samples are drawn, based on which stochastic gradients can be queried. We allow gradient noise to be heavy-tailed, with possibly infinite variances. For the considered heavy-tailed setting, many algorithmic variants have recently been proposed based on gradient clipping or other nonlinear operators (e.g., normalization) applied over noisy gradients. In this paper, we take an alternative approach and propose a novel stochastic first order method dubbed Robust Stochastic Gradient Descent with medoid mini-batch gradient sampling, R-SGD-Mini for short. The core idea of R-SGD-Mini is to split the $K$-sized data batch into $M$ distinct data chunks, form for each chunk the stochastic gradient, and update the solution estimate with respect to the stochastic gradient direction of the chunk that is medoid of gradients of all data-chunks. Under a general class of symmetric heavy-tailed gradient noises and a standard non-convex setting, we establish explicit bounds on the expected time-averaged squared gradient norm. More precisely, we show that the latter quantity converges at rate $\mathcal{O}(T^{-1})$ to a small neighborhood of zero; we explicitly characterize this neighborhood in terms of noise and algorithm's parameters. Moreover, if the time horizon is known in advance, we establish the rate of $\mathcal{O}(T^{-\frac{1}{2}}).$ Furthermore, when clipping is incorporated, we obtain convergence guaranties in the high-probability sense and recover the same rate. Experimental results indicate that R-SGD-Mini and its clipped variant consistently perform favorably compared to SGD, clipped SGD and Median-of-Means based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces R-SGD-Mini, a stochastic first-order method for non-convex optimization under symmetric heavy-tailed gradient noise (possibly with infinite variance). The algorithm splits a batch of K samples into M chunks, computes chunk gradients, selects the medoid gradient, and performs the update with it. The central claims are explicit convergence bounds: the expected time-averaged squared gradient norm converges at rate O(T^{-1}) to a neighborhood of zero whose size is characterized in terms of noise parameters and M; the rate improves to O(T^{-1/2}) when the horizon T is known in advance; and high-probability guarantees at the same rates are obtained when clipping is added. Experiments indicate favorable performance relative to SGD, clipped SGD, and median-of-means baselines.

Significance. If the analysis correctly controls the deviation of the medoid estimator, the work supplies a clipping-free robust alternative with explicit neighborhood sizes and rates that are competitive with existing heavy-tailed methods. The combination of expectation and high-probability results, together with the explicit dependence on algorithm parameters, would be useful for both theory and practice in robust non-convex stochastic optimization.

major comments (2)
  1. [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the stated O(T^{-1}) rate and neighborhood size are derived under the assumption that the medoid gradient satisfies E[g_medoid] = ∇f(x) or that any bias is absorbed without enlarging the neighborhood. For finite M and symmetric heavy-tailed noise with infinite variance, the medoid (the chunk gradient minimizing the sum of Euclidean distances to the other M-1 chunks) is not necessarily unbiased, unlike the coordinate-wise median. The proof must separately bound ||E[g_medoid(x)] - ∇f(x)|| or show that the selection operator preserves the required martingale property; otherwise the neighborhood characterization and rate do not follow.
  2. [§3.2 and clipping theorem] §3.2 (Noise model) and the high-probability result with clipping: the analysis invokes symmetry to control the deviation of the medoid, but the coupling induced by the argmin over all pairwise distances may invalidate standard concentration arguments used for independent medians. A concrete counter-example or moment bound showing that the medoid deviation remains O(1/sqrt(M)) or better under the paper's noise class is required to support the claimed high-probability rate.
minor comments (3)
  1. [Definition 2] Notation for the medoid selection (Definition 2) should explicitly state whether ties are broken randomly or deterministically, as this affects the measurability of the estimator in the martingale analysis.
  2. [Experiments] In the experimental section, the values of M and K used for each dataset should be reported together with the observed neighborhood size to allow direct comparison with the theoretical bound.
  3. [§2] A few minor typos appear in the statement of the non-convex smoothness assumption (Lipschitz constant notation is inconsistent between text and equations).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and will revise the manuscript to incorporate the requested clarifications and supporting arguments.

read point-by-point responses
  1. Referee: [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the stated O(T^{-1}) rate and neighborhood size are derived under the assumption that the medoid gradient satisfies E[g_medoid] = ∇f(x) or that any bias is absorbed without enlarging the neighborhood. For finite M and symmetric heavy-tailed noise with infinite variance, the medoid (the chunk gradient minimizing the sum of Euclidean distances to the other M-1 chunks) is not necessarily unbiased, unlike the coordinate-wise median. The proof must separately bound ||E[g_medoid(x)] - ∇f(x)|| or show that the selection operator preserves the required martingale property; otherwise the neighborhood characterization and rate do not follow.

    Authors: We agree that explicit justification is needed. The M chunk gradients are i.i.d. and exchangeable. The medoid selection rule is fully symmetric across chunks and does not depend on any labeling. By exchangeability, the marginal distribution of the selected g_medoid is identical to the marginal distribution of any individual chunk gradient g_i. Therefore E[g_medoid(x)] = E[g_i(x)] = ∇f(x), provided the stochastic gradient has finite first moment (a standard assumption even when variance is infinite). This holds under the paper's symmetric noise model and preserves the required martingale property for the convergence analysis. We will add a short supporting lemma formalizing this symmetry argument in the revised manuscript. revision: yes

  2. Referee: [§3.2 and clipping theorem] §3.2 (Noise model) and the high-probability result with clipping: the analysis invokes symmetry to control the deviation of the medoid, but the coupling induced by the argmin over all pairwise distances may invalidate standard concentration arguments used for independent medians. A concrete counter-example or moment bound showing that the medoid deviation remains O(1/sqrt(M)) or better under the paper's noise class is required to support the claimed high-probability rate.

    Authors: We acknowledge that the selection-induced dependence requires a dedicated argument rather than direct application of independent-median bounds. In the revised manuscript we will add a lemma that derives the requested moment bound on ||g_medoid(x) - ∇f(x)||. Using the symmetry of the noise and the fact that each chunk is equally likely to be selected under the exchangeable joint distribution, we bound the deviation by O(1/sqrt(M)) in expectation (and obtain matching high-probability tails when combined with clipping). This explicitly justifies the high-probability rates claimed for the clipped variant. revision: yes

Circularity Check

0 steps flagged

No significant circularity; rates follow from standard analysis of medoid estimator under stated assumptions

full rationale

The derivation establishes explicit convergence rates for the time-averaged squared gradient norm by applying standard non-convex smoothness and bounded-gradient assumptions to the R-SGD-Mini update, which selects the medoid chunk gradient. The neighborhood size is characterized directly in terms of the noise distribution parameters and algorithm constants (M, K, step-size schedule), without any reduction of the target quantity to a fitted parameter, self-defined quantity, or load-bearing self-citation. The O(T^{-1}) and O(T^{-1/2}) rates are obtained via the usual telescoping argument on the descent lemma plus a bound on the deviation of the medoid estimator; these steps are independent of the final result and do not rely on renaming or smuggling an ansatz. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard domain assumptions for stochastic non-convex optimization plus the novel medoid sampling mechanism; no free parameters are fitted in the stated rates, and no new entities are postulated.

axioms (2)
  • domain assumption Gradient noise is symmetric and heavy-tailed, possibly with infinite variance
    Invoked to establish robustness and explicit neighborhood bounds for the convergence rates
  • domain assumption Objective is non-convex but satisfies standard smoothness and bounded gradient assumptions
    Standard in non-convex stochastic optimization analysis to bound the time-averaged squared gradient norm

pith-pipeline@v0.9.0 · 5629 in / 1450 out tokens · 82371 ms · 2026-05-11T02:32:28.082351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Optimal high- probability convergence of nonlinear sgd under heavy-tailed noise via symmetrization, 2025

    Aleksandar Armacki, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. Optimal high- probability convergence of nonlinear sgd under heavy-tailed noise via symmetrization, 2025. URLhttps://arxiv.org/abs/2507.09093

  2. [2]

    High-probability convergence bounds for online nonlinear stochastic gradient descent under heavy-tailed noise

    Aleksandar Armacki, Shuhua Yu, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. High-probability convergence bounds for online nonlinear stochastic gradient descent under heavy-tailed noise. InThe 28th International Conference on Artificial Intelligence and Statistics, volume 258, page 9, 2025

  3. [3]

    Heavy tails in sgd and compressibility of overparametrized neural networks.Advances in neural information processing systems, 34:29364–29378, 2021

    Melih Barsbey, Milad Sefidgaran, Murat A Erdogdu, Gael Richard, and Umut Simsekli. Heavy tails in sgd and compressibility of overparametrized neural networks.Advances in neural information processing systems, 34:29364–29378, 2021

  4. [4]

    Revisiting the noise model of stochastic gradient descent

    Barak Battash, Lior Wolf, and Ofir Lindenbaum. Revisiting the noise model of stochastic gradient descent. InInternational Conference on Artificial Intelligence and Statistics, pages 4780–4788. PMLR, 2024

  5. [5]

    signsgd: Compressed optimisation for non-convex problems

    Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. InInternational conference on machine learning, pages 560–569. PMLR, 2018

  6. [6]

    signsgd with majority vote is communication efficient and fault tolerant.arXiv preprint arXiv:1810.05291, 2018

    Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd with majority vote is communication efficient and fault tolerant.arXiv preprint arXiv:1810.05291, 2018

  7. [7]

    Understanding gradient clipping in private sgd: A geometric perspective.Advances in neural information processing systems, 33:13773–13782, 2020

    Xiangyi Chen, Steven Z Wu, and Mingyi Hong. Understanding gradient clipping in private sgd: A geometric perspective.Advances in neural information processing systems, 33:13773–13782, 2020

  8. [8]

    High-probability bounds for non-convex stochastic opti- mization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021

    Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic opti- mization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021

  9. [9]

    Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013

    Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013

  10. [10]

    Stochastic optimization with heavy-tailed noise via accelerated gradient clipping.Advances in Neural Information Processing Systems, 33:15042–15053, 2020

    Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping.Advances in Neural Information Processing Systems, 33:15042–15053, 2020

  11. [11]

    The heavy-tail phenomenon in sgd

    Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in sgd. InInternational Conference on Machine Learning, pages 3964–3975. PMLR, 2021

  12. [12]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  13. [13]

    Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise.SIAM Journal on Optimization, 33(2):394–423,

    Dusan Jakovetic, Dragana Bajovic, Anit Kumar Sahu, Soummya Kar, Nemanja Milosevic, and Dusan Stamenkovic. Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise.SIAM Journal on Optimization, 33(2):394–423,

  14. [14]

    URLhttps://doi.org/10.1137/21M145896X

    doi: 10.1137/21M145896X. URLhttps://doi.org/10.1137/21M145896X

  15. [15]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  16. [16]

    Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise

    Zijian Liu, Jiawei Zhang, and Zhengyuan Zhou. Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise. InThe Thirty Sixth Annual Conference on Learning Theory, pages 2266–2290. PMLR, 2023

  17. [17]

    Mean estimation and regression under heavy-tailed distributions: A survey.Foundations of Computational Mathematics, 19(5):1145–1190, 2019

    Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey.Foundations of Computational Mathematics, 19(5):1145–1190, 2019

  18. [18]

    Sub-gaussian estimators of the mean of a random vector

    Gábor Lugosi and Shahar Mendelson. Sub-gaussian estimators of the mean of a random vector. 2019

  19. [19]

    Geometric median and robust estimation in banach spaces

    Stanislav Minsker. Geometric median and robust estimation in banach spaces. 2015

  20. [20]

    The fundamentals of heavy tails

    Jayakrishnan Nair, Adam Wierman, and Bert Zwart. The fundamentals of heavy tails. 2013. 10

  21. [21]

    Robust stochastic approximation approach to stochastic programming.SIAM Journal on optimization, 19(4): 1574–1609, 2009

    Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on optimization, 19(4): 1574–1609, 2009

  22. [22]

    Improved convergence in high probability of clipped gradient methods with heavy tailed noise.Advances in Neural Information Processing Systems, 36:24191–24222, 2023

    Ta Duy Nguyen, Thien H Nguyen, Alina Ene, and Huy Nguyen. Improved convergence in high probability of clipped gradient methods with heavy tailed noise.Advances in Neural Information Processing Systems, 36:24191–24222, 2023

  23. [23]

    High probability convergence of clipped-sgd under heavy-tailed noise.arXiv preprint arXiv:2302.05437, 2023

    Ta Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Le Nguyen. High probability convergence of clipped-sgd under heavy-tailed noise.arXiv preprint arXiv:2302.05437, 2023

  24. [24]

    Stable behaviour of infinitely wide deep neural networks

    Stefano Peluchetti, Stefano Favaro, and Sandra Fortini. Stable behaviour of infinitely wide deep neural networks. InInternational Conference on Artificial Intelligence and Statistics, pages 1137–1146. PMLR, 2020

  25. [25]

    Breaking the heavy-tailed noise barrier in stochastic optimization problems

    Nikita Puchkin, Eduard Gorbunov, Nickolay Kutuzov, and Alexander Gasnikov. Breaking the heavy-tailed noise barrier in stochastic optimization problems. InInternational Conference on Artificial Intelligence and Statistics, pages 856–864. PMLR, 2024

  26. [26]

    A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

  27. [27]

    Faster k-medoids clustering: improving the pam, clara, and clarans algorithms

    Erich Schubert and Peter J Rousseeuw. Faster k-medoids clustering: improving the pam, clara, and clarans algorithms. InInternational conference on similarity search and applications, pages 171–187. Springer, 2019

  28. [28]

    Fast and eager k-medoids clustering: O (k) runtime improvement of the pam, clara, and clarans algorithms.Information Systems, 101:101804, 2021

    Erich Schubert and Peter J Rousseeuw. Fast and eager k-medoids clustering: O (k) runtime improvement of the pam, clara, and clarans algorithms.Information Systems, 101:101804, 2021

  29. [30]

    A tail-index analysis of stochastic gradient noise in deep neural networks

    Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pages 5827–5837. PMLR, 2019

  30. [31]

    Why are adaptive methods good for attention models?Advances in Neural Information Processing Systems, 33:15383–15393, 2020

    Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models?Advances in Neural Information Processing Systems, 33:15383–15393, 2020. A Proof of Lemma 1 Proof. Suppose that the implication does not hold. Without loss of generality, this implies that∥ν...