Recognition: 2 theorem links
· Lean TheoremRobust stochastic first order methods in heavy-tailed noise via medoid mini-batch gradient sampling
Pith reviewed 2026-05-11 02:32 UTC · model grok-4.3
The pith
Medoid selection on mini-batch gradients yields O(T^{-1}) convergence for non-convex stochastic optimization under heavy-tailed noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a general class of symmetric heavy-tailed gradient noises and standard non-convex assumptions, R-SGD-Mini establishes explicit bounds showing that the expected time-averaged squared gradient norm converges at rate O(T^{-1}) to a small neighborhood of zero characterized by the noise and parameters; the rate improves to O(T^{-1/2}) when the horizon is known in advance, and clipping yields high-probability convergence at the same rates.
What carries the argument
Medoid mini-batch gradient sampling: the K-sample batch is partitioned into M chunks, a gradient is formed for each chunk, and the update uses the medoid gradient among the M vectors.
If this is right
- The method converges without requiring finite variance or higher moments on the stochastic gradients.
- The neighborhood of convergence is explicitly characterized in terms of noise parameters and algorithm constants.
- High-probability guarantees are recovered by incorporating gradient clipping.
- Empirical performance is competitive with or better than SGD, clipped SGD, and median-of-means methods on the tested problems.
Where Pith is reading between the lines
- This selection strategy could be tested on other heavy-tailed settings such as distributed optimization with communication noise.
- Combining the medoid step with adaptive step sizes or momentum might further improve practical rates beyond the proved bounds.
- The approach suggests a broader class of order-statistic based gradient estimators for robust stochastic optimization.
Load-bearing premise
The stochastic gradient noise is symmetric heavy-tailed and the medoid selection step produces no additional bias outside the analyzed terms.
What would settle it
Running the algorithm on a non-convex test problem with Cauchy noise and observing that the time-averaged squared gradient norm does not decrease toward the predicted neighborhood at rate O(T^{-1}).
Figures
read the original abstract
We consider a first order stochastic optimization framework where, at each iteration, $K$ independent identically distributed (i.i.d.) data point samples are drawn, based on which stochastic gradients can be queried. We allow gradient noise to be heavy-tailed, with possibly infinite variances. For the considered heavy-tailed setting, many algorithmic variants have recently been proposed based on gradient clipping or other nonlinear operators (e.g., normalization) applied over noisy gradients. In this paper, we take an alternative approach and propose a novel stochastic first order method dubbed Robust Stochastic Gradient Descent with medoid mini-batch gradient sampling, R-SGD-Mini for short. The core idea of R-SGD-Mini is to split the $K$-sized data batch into $M$ distinct data chunks, form for each chunk the stochastic gradient, and update the solution estimate with respect to the stochastic gradient direction of the chunk that is medoid of gradients of all data-chunks. Under a general class of symmetric heavy-tailed gradient noises and a standard non-convex setting, we establish explicit bounds on the expected time-averaged squared gradient norm. More precisely, we show that the latter quantity converges at rate $\mathcal{O}(T^{-1})$ to a small neighborhood of zero; we explicitly characterize this neighborhood in terms of noise and algorithm's parameters. Moreover, if the time horizon is known in advance, we establish the rate of $\mathcal{O}(T^{-\frac{1}{2}}).$ Furthermore, when clipping is incorporated, we obtain convergence guaranties in the high-probability sense and recover the same rate. Experimental results indicate that R-SGD-Mini and its clipped variant consistently perform favorably compared to SGD, clipped SGD and Median-of-Means based methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces R-SGD-Mini, a stochastic first-order method for non-convex optimization under symmetric heavy-tailed gradient noise (possibly with infinite variance). The algorithm splits a batch of K samples into M chunks, computes chunk gradients, selects the medoid gradient, and performs the update with it. The central claims are explicit convergence bounds: the expected time-averaged squared gradient norm converges at rate O(T^{-1}) to a neighborhood of zero whose size is characterized in terms of noise parameters and M; the rate improves to O(T^{-1/2}) when the horizon T is known in advance; and high-probability guarantees at the same rates are obtained when clipping is added. Experiments indicate favorable performance relative to SGD, clipped SGD, and median-of-means baselines.
Significance. If the analysis correctly controls the deviation of the medoid estimator, the work supplies a clipping-free robust alternative with explicit neighborhood sizes and rates that are competitive with existing heavy-tailed methods. The combination of expectation and high-probability results, together with the explicit dependence on algorithm parameters, would be useful for both theory and practice in robust non-convex stochastic optimization.
major comments (2)
- [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the stated O(T^{-1}) rate and neighborhood size are derived under the assumption that the medoid gradient satisfies E[g_medoid] = ∇f(x) or that any bias is absorbed without enlarging the neighborhood. For finite M and symmetric heavy-tailed noise with infinite variance, the medoid (the chunk gradient minimizing the sum of Euclidean distances to the other M-1 chunks) is not necessarily unbiased, unlike the coordinate-wise median. The proof must separately bound ||E[g_medoid(x)] - ∇f(x)|| or show that the selection operator preserves the required martingale property; otherwise the neighborhood characterization and rate do not follow.
- [§3.2 and clipping theorem] §3.2 (Noise model) and the high-probability result with clipping: the analysis invokes symmetry to control the deviation of the medoid, but the coupling induced by the argmin over all pairwise distances may invalidate standard concentration arguments used for independent medians. A concrete counter-example or moment bound showing that the medoid deviation remains O(1/sqrt(M)) or better under the paper's noise class is required to support the claimed high-probability rate.
minor comments (3)
- [Definition 2] Notation for the medoid selection (Definition 2) should explicitly state whether ties are broken randomly or deterministically, as this affects the measurability of the estimator in the martingale analysis.
- [Experiments] In the experimental section, the values of M and K used for each dataset should be reported together with the observed neighborhood size to allow direct comparison with the theoretical bound.
- [§2] A few minor typos appear in the statement of the non-convex smoothness assumption (Lipschitz constant notation is inconsistent between text and equations).
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address each major point below and will revise the manuscript to incorporate the requested clarifications and supporting arguments.
read point-by-point responses
-
Referee: [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the stated O(T^{-1}) rate and neighborhood size are derived under the assumption that the medoid gradient satisfies E[g_medoid] = ∇f(x) or that any bias is absorbed without enlarging the neighborhood. For finite M and symmetric heavy-tailed noise with infinite variance, the medoid (the chunk gradient minimizing the sum of Euclidean distances to the other M-1 chunks) is not necessarily unbiased, unlike the coordinate-wise median. The proof must separately bound ||E[g_medoid(x)] - ∇f(x)|| or show that the selection operator preserves the required martingale property; otherwise the neighborhood characterization and rate do not follow.
Authors: We agree that explicit justification is needed. The M chunk gradients are i.i.d. and exchangeable. The medoid selection rule is fully symmetric across chunks and does not depend on any labeling. By exchangeability, the marginal distribution of the selected g_medoid is identical to the marginal distribution of any individual chunk gradient g_i. Therefore E[g_medoid(x)] = E[g_i(x)] = ∇f(x), provided the stochastic gradient has finite first moment (a standard assumption even when variance is infinite). This holds under the paper's symmetric noise model and preserves the required martingale property for the convergence analysis. We will add a short supporting lemma formalizing this symmetry argument in the revised manuscript. revision: yes
-
Referee: [§3.2 and clipping theorem] §3.2 (Noise model) and the high-probability result with clipping: the analysis invokes symmetry to control the deviation of the medoid, but the coupling induced by the argmin over all pairwise distances may invalidate standard concentration arguments used for independent medians. A concrete counter-example or moment bound showing that the medoid deviation remains O(1/sqrt(M)) or better under the paper's noise class is required to support the claimed high-probability rate.
Authors: We acknowledge that the selection-induced dependence requires a dedicated argument rather than direct application of independent-median bounds. In the revised manuscript we will add a lemma that derives the requested moment bound on ||g_medoid(x) - ∇f(x)||. Using the symmetry of the noise and the fact that each chunk is equally likely to be selected under the exchangeable joint distribution, we bound the deviation by O(1/sqrt(M)) in expectation (and obtain matching high-probability tails when combined with clipping). This explicitly justifies the high-probability rates claimed for the clipped variant. revision: yes
Circularity Check
No significant circularity; rates follow from standard analysis of medoid estimator under stated assumptions
full rationale
The derivation establishes explicit convergence rates for the time-averaged squared gradient norm by applying standard non-convex smoothness and bounded-gradient assumptions to the R-SGD-Mini update, which selects the medoid chunk gradient. The neighborhood size is characterized directly in terms of the noise distribution parameters and algorithm constants (M, K, step-size schedule), without any reduction of the target quantity to a fitted parameter, self-defined quantity, or load-bearing self-citation. The O(T^{-1}) and O(T^{-1/2}) rates are obtained via the usual telescoping argument on the descent lemma plus a bound on the deviation of the medoid estimator; these steps are independent of the final result and do not rely on renaming or smuggling an ansatz. The analysis is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Gradient noise is symmetric and heavy-tailed, possibly with infinite variance
- domain assumption Objective is non-convex but satisfies standard smoothness and bounded gradient assumptions
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
split the K-sized data batch into M distinct data chunks... update... with respect to the stochastic gradient direction of the chunk that is medoid of gradients of all data-chunks
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
E[||ν_j⋆||²] ≤ O(R^{-2(p-1)/p}) ... E[ν_j⋆]=0
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Aleksandar Armacki, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. Optimal high- probability convergence of nonlinear sgd under heavy-tailed noise via symmetrization, 2025. URLhttps://arxiv.org/abs/2507.09093
-
[2]
Aleksandar Armacki, Shuhua Yu, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. High-probability convergence bounds for online nonlinear stochastic gradient descent under heavy-tailed noise. InThe 28th International Conference on Artificial Intelligence and Statistics, volume 258, page 9, 2025
work page 2025
-
[3]
Melih Barsbey, Milad Sefidgaran, Murat A Erdogdu, Gael Richard, and Umut Simsekli. Heavy tails in sgd and compressibility of overparametrized neural networks.Advances in neural information processing systems, 34:29364–29378, 2021
work page 2021
-
[4]
Revisiting the noise model of stochastic gradient descent
Barak Battash, Lior Wolf, and Ofir Lindenbaum. Revisiting the noise model of stochastic gradient descent. InInternational Conference on Artificial Intelligence and Statistics, pages 4780–4788. PMLR, 2024
work page 2024
-
[5]
signsgd: Compressed optimisation for non-convex problems
Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. InInternational conference on machine learning, pages 560–569. PMLR, 2018
work page 2018
-
[6]
Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd with majority vote is communication efficient and fault tolerant.arXiv preprint arXiv:1810.05291, 2018
-
[7]
Xiangyi Chen, Steven Z Wu, and Mingyi Hong. Understanding gradient clipping in private sgd: A geometric perspective.Advances in neural information processing systems, 33:13773–13782, 2020
work page 2020
-
[8]
Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic opti- mization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021
work page 2021
-
[9]
Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013
work page 2013
-
[10]
Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping.Advances in Neural Information Processing Systems, 33:15042–15053, 2020
work page 2020
-
[11]
The heavy-tail phenomenon in sgd
Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in sgd. InInternational Conference on Machine Learning, pages 3964–3975. PMLR, 2021
work page 2021
-
[12]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[13]
Dusan Jakovetic, Dragana Bajovic, Anit Kumar Sahu, Soummya Kar, Nemanja Milosevic, and Dusan Stamenkovic. Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise.SIAM Journal on Optimization, 33(2):394–423,
-
[14]
URLhttps://doi.org/10.1137/21M145896X
doi: 10.1137/21M145896X. URLhttps://doi.org/10.1137/21M145896X
-
[15]
Learning multiple layers of features from tiny images
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009
work page 2009
-
[16]
Zijian Liu, Jiawei Zhang, and Zhengyuan Zhou. Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise. InThe Thirty Sixth Annual Conference on Learning Theory, pages 2266–2290. PMLR, 2023
work page 2023
-
[17]
Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey.Foundations of Computational Mathematics, 19(5):1145–1190, 2019
work page 2019
-
[18]
Sub-gaussian estimators of the mean of a random vector
Gábor Lugosi and Shahar Mendelson. Sub-gaussian estimators of the mean of a random vector. 2019
work page 2019
-
[19]
Geometric median and robust estimation in banach spaces
Stanislav Minsker. Geometric median and robust estimation in banach spaces. 2015
work page 2015
-
[20]
The fundamentals of heavy tails
Jayakrishnan Nair, Adam Wierman, and Bert Zwart. The fundamentals of heavy tails. 2013. 10
work page 2013
-
[21]
Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on optimization, 19(4): 1574–1609, 2009
work page 2009
-
[22]
Ta Duy Nguyen, Thien H Nguyen, Alina Ene, and Huy Nguyen. Improved convergence in high probability of clipped gradient methods with heavy tailed noise.Advances in Neural Information Processing Systems, 36:24191–24222, 2023
work page 2023
-
[23]
Ta Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Le Nguyen. High probability convergence of clipped-sgd under heavy-tailed noise.arXiv preprint arXiv:2302.05437, 2023
-
[24]
Stable behaviour of infinitely wide deep neural networks
Stefano Peluchetti, Stefano Favaro, and Sandra Fortini. Stable behaviour of infinitely wide deep neural networks. InInternational Conference on Artificial Intelligence and Statistics, pages 1137–1146. PMLR, 2020
work page 2020
-
[25]
Breaking the heavy-tailed noise barrier in stochastic optimization problems
Nikita Puchkin, Eduard Gorbunov, Nickolay Kutuzov, and Alexander Gasnikov. Breaking the heavy-tailed noise barrier in stochastic optimization problems. InInternational Conference on Artificial Intelligence and Statistics, pages 856–864. PMLR, 2024
work page 2024
-
[26]
A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
work page 1951
-
[27]
Faster k-medoids clustering: improving the pam, clara, and clarans algorithms
Erich Schubert and Peter J Rousseeuw. Faster k-medoids clustering: improving the pam, clara, and clarans algorithms. InInternational conference on similarity search and applications, pages 171–187. Springer, 2019
work page 2019
-
[28]
Erich Schubert and Peter J Rousseeuw. Fast and eager k-medoids clustering: O (k) runtime improvement of the pam, clara, and clarans algorithms.Information Systems, 101:101804, 2021
work page 2021
-
[30]
A tail-index analysis of stochastic gradient noise in deep neural networks
Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pages 5827–5837. PMLR, 2019
work page 2019
-
[31]
Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models?Advances in Neural Information Processing Systems, 33:15383–15393, 2020. A Proof of Lemma 1 Proof. Suppose that the implication does not hold. Without loss of generality, this implies that∥ν...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.