arxiv: 2605.07634 · v1 · submitted 2026-05-08 · 🧮 math.OC · cs.LG· math.ST· stat.TH

Recognition: 2 theorem links

· Lean Theorem

Robust stochastic first order methods in heavy-tailed noise via medoid mini-batch gradient sampling

Manojlo Vukovic , Dusan Jakovetic

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:32 UTC · model grok-4.3

classification 🧮 math.OC cs.LGmath.STstat.TH

keywords stochastic gradient descentheavy-tailed noiserobust optimizationnon-convex optimizationmedoid samplingmini-batch methods

0 comments

The pith

Medoid selection on mini-batch gradients yields O(T^{-1}) convergence for non-convex stochastic optimization under heavy-tailed noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops R-SGD-Mini, which splits each batch of K samples into M chunks, computes one stochastic gradient per chunk, and takes the medoid gradient for the parameter update. Under symmetric heavy-tailed noise with possibly infinite variance and standard non-convex smoothness, the expected time-averaged squared gradient norm is shown to converge at rate O(T^{-1}) to a noise-dependent neighborhood of zero. With a known time horizon the rate becomes O(T^{-1/2}), and clipping the gradients produces matching high-probability bounds. A reader would care because this provides an alternative robustness mechanism to clipping or normalization that works directly with the batch structure.

Core claim

Under a general class of symmetric heavy-tailed gradient noises and standard non-convex assumptions, R-SGD-Mini establishes explicit bounds showing that the expected time-averaged squared gradient norm converges at rate O(T^{-1}) to a small neighborhood of zero characterized by the noise and parameters; the rate improves to O(T^{-1/2}) when the horizon is known in advance, and clipping yields high-probability convergence at the same rates.

What carries the argument

Medoid mini-batch gradient sampling: the K-sample batch is partitioned into M chunks, a gradient is formed for each chunk, and the update uses the medoid gradient among the M vectors.

If this is right

The method converges without requiring finite variance or higher moments on the stochastic gradients.
The neighborhood of convergence is explicitly characterized in terms of noise parameters and algorithm constants.
High-probability guarantees are recovered by incorporating gradient clipping.
Empirical performance is competitive with or better than SGD, clipped SGD, and median-of-means methods on the tested problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This selection strategy could be tested on other heavy-tailed settings such as distributed optimization with communication noise.
Combining the medoid step with adaptive step sizes or momentum might further improve practical rates beyond the proved bounds.
The approach suggests a broader class of order-statistic based gradient estimators for robust stochastic optimization.

Load-bearing premise

The stochastic gradient noise is symmetric heavy-tailed and the medoid selection step produces no additional bias outside the analyzed terms.

What would settle it

Running the algorithm on a non-convex test problem with Cauchy noise and observing that the time-averaged squared gradient norm does not decrease toward the predicted neighborhood at rate O(T^{-1}).

Figures

Figures reproduced from arXiv: 2605.07634 by Dusan Jakovetic, Manojlo Vukovic.

**Figure 2.** Figure 2: Test Accuracy over Iterations on CIFAR-10 (ResNet-18) [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

read the original abstract

We consider a first order stochastic optimization framework where, at each iteration, $K$ independent identically distributed (i.i.d.) data point samples are drawn, based on which stochastic gradients can be queried. We allow gradient noise to be heavy-tailed, with possibly infinite variances. For the considered heavy-tailed setting, many algorithmic variants have recently been proposed based on gradient clipping or other nonlinear operators (e.g., normalization) applied over noisy gradients. In this paper, we take an alternative approach and propose a novel stochastic first order method dubbed Robust Stochastic Gradient Descent with medoid mini-batch gradient sampling, R-SGD-Mini for short. The core idea of R-SGD-Mini is to split the $K$-sized data batch into $M$ distinct data chunks, form for each chunk the stochastic gradient, and update the solution estimate with respect to the stochastic gradient direction of the chunk that is medoid of gradients of all data-chunks. Under a general class of symmetric heavy-tailed gradient noises and a standard non-convex setting, we establish explicit bounds on the expected time-averaged squared gradient norm. More precisely, we show that the latter quantity converges at rate $\mathcal{O}(T^{-1})$ to a small neighborhood of zero; we explicitly characterize this neighborhood in terms of noise and algorithm's parameters. Moreover, if the time horizon is known in advance, we establish the rate of $\mathcal{O}(T^{-\frac{1}{2}}).$ Furthermore, when clipping is incorporated, we obtain convergence guaranties in the high-probability sense and recover the same rate. Experimental results indicate that R-SGD-Mini and its clipped variant consistently perform favorably compared to SGD, clipped SGD and Median-of-Means based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Medoid chunk selection gives a clean alternative robustness trick for heavy-tailed SGD but the bias control in the analysis needs verification to back the stated rates.

read the letter

The paper's main contribution is a stochastic gradient method that splits the mini-batch into chunks, computes a gradient per chunk, and steps in the direction of the medoid chunk gradient. Under symmetric heavy-tailed noise they get an expected convergence rate of O(1/T) to a neighborhood they characterize in terms of the noise and the number of chunks M. With a known time horizon they improve it to O(1/sqrt(T)), and adding clipping gives high-probability versions at the same rates. They do a decent job presenting this as a distinct option from clipping or median-of-means estimators. The medoid is easy to compute once you have the chunk gradients, and the abstract makes clear that the neighborhood size depends explicitly on the parameters rather than hiding constants. The experiments are reported to show consistent improvement over SGD, clipped SGD, and median-of-means on the problems they tested. The potential issue is whether the medoid selection introduces bias that isn't fully controlled. For symmetric noise the coordinate median is unbiased, but the medoid is the point that minimizes the sum of distances to the others. With finite M and infinite variance, its expectation may not sit at the true gradient, and the selection couples the chunks so standard unbiasedness arguments may not apply directly. If the analysis does not bound the difference between E[medoid gradient] and the true gradient separately, the stated neighborhood could be optimistic. The abstract claims an explicit characterization, so they likely attempt to handle it through the noise assumptions, but the strength of that step is what needs verification in the proofs. This work is aimed at researchers who care about practical robustness in non-convex stochastic optimization when gradients have heavy tails. Someone looking for a simple-to-implement alternative to existing robustifiers would find it useful to read. It is worth sending to peer review because the algorithmic idea is clear, the rates are stated precisely, and the experimental comparison is there to be examined, even if the bias question requires careful checking.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces R-SGD-Mini, a stochastic first-order method for non-convex optimization under symmetric heavy-tailed gradient noise (possibly with infinite variance). The algorithm splits a batch of K samples into M chunks, computes chunk gradients, selects the medoid gradient, and performs the update with it. The central claims are explicit convergence bounds: the expected time-averaged squared gradient norm converges at rate O(T^{-1}) to a neighborhood of zero whose size is characterized in terms of noise parameters and M; the rate improves to O(T^{-1/2}) when the horizon T is known in advance; and high-probability guarantees at the same rates are obtained when clipping is added. Experiments indicate favorable performance relative to SGD, clipped SGD, and median-of-means baselines.

Significance. If the analysis correctly controls the deviation of the medoid estimator, the work supplies a clipping-free robust alternative with explicit neighborhood sizes and rates that are competitive with existing heavy-tailed methods. The combination of expectation and high-probability results, together with the explicit dependence on algorithm parameters, would be useful for both theory and practice in robust non-convex stochastic optimization.

major comments (2)

[§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the stated O(T^{-1}) rate and neighborhood size are derived under the assumption that the medoid gradient satisfies E[g_medoid] = ∇f(x) or that any bias is absorbed without enlarging the neighborhood. For finite M and symmetric heavy-tailed noise with infinite variance, the medoid (the chunk gradient minimizing the sum of Euclidean distances to the other M-1 chunks) is not necessarily unbiased, unlike the coordinate-wise median. The proof must separately bound ||E[g_medoid(x)] - ∇f(x)|| or show that the selection operator preserves the required martingale property; otherwise the neighborhood characterization and rate do not follow.
[§3.2 and clipping theorem] §3.2 (Noise model) and the high-probability result with clipping: the analysis invokes symmetry to control the deviation of the medoid, but the coupling induced by the argmin over all pairwise distances may invalidate standard concentration arguments used for independent medians. A concrete counter-example or moment bound showing that the medoid deviation remains O(1/sqrt(M)) or better under the paper's noise class is required to support the claimed high-probability rate.

minor comments (3)

[Definition 2] Notation for the medoid selection (Definition 2) should explicitly state whether ties are broken randomly or deterministically, as this affects the measurability of the estimator in the martingale analysis.
[Experiments] In the experimental section, the values of M and K used for each dataset should be reported together with the observed neighborhood size to allow direct comparison with the theoretical bound.
[§2] A few minor typos appear in the statement of the non-convex smoothness assumption (Lipschitz constant notation is inconsistent between text and equations).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and will revise the manuscript to incorporate the requested clarifications and supporting arguments.

read point-by-point responses

Referee: [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the stated O(T^{-1}) rate and neighborhood size are derived under the assumption that the medoid gradient satisfies E[g_medoid] = ∇f(x) or that any bias is absorbed without enlarging the neighborhood. For finite M and symmetric heavy-tailed noise with infinite variance, the medoid (the chunk gradient minimizing the sum of Euclidean distances to the other M-1 chunks) is not necessarily unbiased, unlike the coordinate-wise median. The proof must separately bound ||E[g_medoid(x)] - ∇f(x)|| or show that the selection operator preserves the required martingale property; otherwise the neighborhood characterization and rate do not follow.

Authors: We agree that explicit justification is needed. The M chunk gradients are i.i.d. and exchangeable. The medoid selection rule is fully symmetric across chunks and does not depend on any labeling. By exchangeability, the marginal distribution of the selected g_medoid is identical to the marginal distribution of any individual chunk gradient g_i. Therefore E[g_medoid(x)] = E[g_i(x)] = ∇f(x), provided the stochastic gradient has finite first moment (a standard assumption even when variance is infinite). This holds under the paper's symmetric noise model and preserves the required martingale property for the convergence analysis. We will add a short supporting lemma formalizing this symmetry argument in the revised manuscript. revision: yes
Referee: [§3.2 and clipping theorem] §3.2 (Noise model) and the high-probability result with clipping: the analysis invokes symmetry to control the deviation of the medoid, but the coupling induced by the argmin over all pairwise distances may invalidate standard concentration arguments used for independent medians. A concrete counter-example or moment bound showing that the medoid deviation remains O(1/sqrt(M)) or better under the paper's noise class is required to support the claimed high-probability rate.

Authors: We acknowledge that the selection-induced dependence requires a dedicated argument rather than direct application of independent-median bounds. In the revised manuscript we will add a lemma that derives the requested moment bound on ||g_medoid(x) - ∇f(x)||. Using the symmetry of the noise and the fact that each chunk is equally likely to be selected under the exchangeable joint distribution, we bound the deviation by O(1/sqrt(M)) in expectation (and obtain matching high-probability tails when combined with clipping). This explicitly justifies the high-probability rates claimed for the clipped variant. revision: yes

Circularity Check

0 steps flagged

No significant circularity; rates follow from standard analysis of medoid estimator under stated assumptions

full rationale

The derivation establishes explicit convergence rates for the time-averaged squared gradient norm by applying standard non-convex smoothness and bounded-gradient assumptions to the R-SGD-Mini update, which selects the medoid chunk gradient. The neighborhood size is characterized directly in terms of the noise distribution parameters and algorithm constants (M, K, step-size schedule), without any reduction of the target quantity to a fitted parameter, self-defined quantity, or load-bearing self-citation. The O(T^{-1}) and O(T^{-1/2}) rates are obtained via the usual telescoping argument on the descent lemma plus a bound on the deviation of the medoid estimator; these steps are independent of the final result and do not rely on renaming or smuggling an ansatz. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard domain assumptions for stochastic non-convex optimization plus the novel medoid sampling mechanism; no free parameters are fitted in the stated rates, and no new entities are postulated.

axioms (2)

domain assumption Gradient noise is symmetric and heavy-tailed, possibly with infinite variance
Invoked to establish robustness and explicit neighborhood bounds for the convergence rates
domain assumption Objective is non-convex but satisfies standard smoothness and bounded gradient assumptions
Standard in non-convex stochastic optimization analysis to bound the time-averaged squared gradient norm

pith-pipeline@v0.9.0 · 5629 in / 1450 out tokens · 82371 ms · 2026-05-11T02:32:28.082351+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

split the K-sized data batch into M distinct data chunks... update... with respect to the stochastic gradient direction of the chunk that is medoid of gradients of all data-chunks
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

E[||ν_j⋆||²] ≤ O(R^{-2(p-1)/p}) ... E[ν_j⋆]=0

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

[1]

Optimal high- probability convergence of nonlinear sgd under heavy-tailed noise via symmetrization, 2025

Aleksandar Armacki, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. Optimal high- probability convergence of nonlinear sgd under heavy-tailed noise via symmetrization, 2025. URLhttps://arxiv.org/abs/2507.09093

work page arXiv 2025
[2]

High-probability convergence bounds for online nonlinear stochastic gradient descent under heavy-tailed noise

Aleksandar Armacki, Shuhua Yu, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. High-probability convergence bounds for online nonlinear stochastic gradient descent under heavy-tailed noise. InThe 28th International Conference on Artificial Intelligence and Statistics, volume 258, page 9, 2025

work page 2025
[3]

Heavy tails in sgd and compressibility of overparametrized neural networks.Advances in neural information processing systems, 34:29364–29378, 2021

Melih Barsbey, Milad Sefidgaran, Murat A Erdogdu, Gael Richard, and Umut Simsekli. Heavy tails in sgd and compressibility of overparametrized neural networks.Advances in neural information processing systems, 34:29364–29378, 2021

work page 2021
[4]

Revisiting the noise model of stochastic gradient descent

Barak Battash, Lior Wolf, and Ofir Lindenbaum. Revisiting the noise model of stochastic gradient descent. InInternational Conference on Artificial Intelligence and Statistics, pages 4780–4788. PMLR, 2024

work page 2024
[5]

signsgd: Compressed optimisation for non-convex problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. InInternational conference on machine learning, pages 560–569. PMLR, 2018

work page 2018
[6]

signsgd with majority vote is communication efficient and fault tolerant.arXiv preprint arXiv:1810.05291, 2018

Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd with majority vote is communication efficient and fault tolerant.arXiv preprint arXiv:1810.05291, 2018

work page arXiv 2018
[7]

Understanding gradient clipping in private sgd: A geometric perspective.Advances in neural information processing systems, 33:13773–13782, 2020

Xiangyi Chen, Steven Z Wu, and Mingyi Hong. Understanding gradient clipping in private sgd: A geometric perspective.Advances in neural information processing systems, 33:13773–13782, 2020

work page 2020
[8]

High-probability bounds for non-convex stochastic opti- mization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021

Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic opti- mization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021

work page 2021
[9]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013

work page 2013
[10]

Stochastic optimization with heavy-tailed noise via accelerated gradient clipping.Advances in Neural Information Processing Systems, 33:15042–15053, 2020

Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping.Advances in Neural Information Processing Systems, 33:15042–15053, 2020

work page 2020
[11]

The heavy-tail phenomenon in sgd

Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in sgd. InInternational Conference on Machine Learning, pages 3964–3975. PMLR, 2021

work page 2021
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[13]

Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise.SIAM Journal on Optimization, 33(2):394–423,

Dusan Jakovetic, Dragana Bajovic, Anit Kumar Sahu, Soummya Kar, Nemanja Milosevic, and Dusan Stamenkovic. Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise.SIAM Journal on Optimization, 33(2):394–423,

work page
[14]

URLhttps://doi.org/10.1137/21M145896X

doi: 10.1137/21M145896X. URLhttps://doi.org/10.1137/21M145896X

work page doi:10.1137/21m145896x
[15]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[16]

Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise

Zijian Liu, Jiawei Zhang, and Zhengyuan Zhou. Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise. InThe Thirty Sixth Annual Conference on Learning Theory, pages 2266–2290. PMLR, 2023

work page 2023
[17]

Mean estimation and regression under heavy-tailed distributions: A survey.Foundations of Computational Mathematics, 19(5):1145–1190, 2019

Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey.Foundations of Computational Mathematics, 19(5):1145–1190, 2019

work page 2019
[18]

Sub-gaussian estimators of the mean of a random vector

Gábor Lugosi and Shahar Mendelson. Sub-gaussian estimators of the mean of a random vector. 2019

work page 2019
[19]

Geometric median and robust estimation in banach spaces

Stanislav Minsker. Geometric median and robust estimation in banach spaces. 2015

work page 2015
[20]

The fundamentals of heavy tails

Jayakrishnan Nair, Adam Wierman, and Bert Zwart. The fundamentals of heavy tails. 2013. 10

work page 2013
[21]

Robust stochastic approximation approach to stochastic programming.SIAM Journal on optimization, 19(4): 1574–1609, 2009

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on optimization, 19(4): 1574–1609, 2009

work page 2009
[22]

Improved convergence in high probability of clipped gradient methods with heavy tailed noise.Advances in Neural Information Processing Systems, 36:24191–24222, 2023

Ta Duy Nguyen, Thien H Nguyen, Alina Ene, and Huy Nguyen. Improved convergence in high probability of clipped gradient methods with heavy tailed noise.Advances in Neural Information Processing Systems, 36:24191–24222, 2023

work page 2023
[23]

High probability convergence of clipped-sgd under heavy-tailed noise.arXiv preprint arXiv:2302.05437, 2023

Ta Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Le Nguyen. High probability convergence of clipped-sgd under heavy-tailed noise.arXiv preprint arXiv:2302.05437, 2023

work page arXiv 2023
[24]

Stable behaviour of infinitely wide deep neural networks

Stefano Peluchetti, Stefano Favaro, and Sandra Fortini. Stable behaviour of infinitely wide deep neural networks. InInternational Conference on Artificial Intelligence and Statistics, pages 1137–1146. PMLR, 2020

work page 2020
[25]

Breaking the heavy-tailed noise barrier in stochastic optimization problems

Nikita Puchkin, Eduard Gorbunov, Nickolay Kutuzov, and Alexander Gasnikov. Breaking the heavy-tailed noise barrier in stochastic optimization problems. InInternational Conference on Artificial Intelligence and Statistics, pages 856–864. PMLR, 2024

work page 2024
[26]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

work page 1951
[27]

Faster k-medoids clustering: improving the pam, clara, and clarans algorithms

Erich Schubert and Peter J Rousseeuw. Faster k-medoids clustering: improving the pam, clara, and clarans algorithms. InInternational conference on similarity search and applications, pages 171–187. Springer, 2019

work page 2019
[28]

Fast and eager k-medoids clustering: O (k) runtime improvement of the pam, clara, and clarans algorithms.Information Systems, 101:101804, 2021

Erich Schubert and Peter J Rousseeuw. Fast and eager k-medoids clustering: O (k) runtime improvement of the pam, clara, and clarans algorithms.Information Systems, 101:101804, 2021

work page 2021
[30]

A tail-index analysis of stochastic gradient noise in deep neural networks

Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pages 5827–5837. PMLR, 2019

work page 2019
[31]

Why are adaptive methods good for attention models?Advances in Neural Information Processing Systems, 33:15383–15393, 2020

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models?Advances in Neural Information Processing Systems, 33:15383–15393, 2020. A Proof of Lemma 1 Proof. Suppose that the implication does not hold. Without loss of generality, this implies that∥ν...

work page 2020