Stochastic Zeroth-Order Optimization Under Heavy-Tailed Noise

El Mahdi Chayti; Imane Rahali; Omar Saadi; Qiuyi Zhang; Taha El Bakkali

arxiv: 2605.17394 · v1 · pith:AEK7UWYQnew · submitted 2026-05-17 · 🧮 math.OC

Stochastic Zeroth-Order Optimization Under Heavy-Tailed Noise

Taha El Bakkali , El Mahdi Chayti , Qiuyi Zhang , Imane Rahali , Omar Saadi This is my paper

Pith reviewed 2026-05-19 23:15 UTC · model grok-4.3

classification 🧮 math.OC

keywords zeroth-order optimizationheavy-tailed noisenonconvex optimizationstochastic optimizationhigh-probability boundsclipped estimatorsfinite moments

0 comments

The pith

Clipped scalar directional estimates let zeroth-order methods find stationary points under heavy-tailed noise with near-optimal query rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that clipping each scalar directional derivative before aggregation lets a two-point zeroth-order algorithm reach an epsilon-stationary point with high probability for smooth nonconvex objectives whose sample gradients obey only a weak-L_p tail bound. A sympathetic reader would care because this removes the classical bounded-variance assumption that often fails in modern machine-learning settings while still matching the best-known first-order dependence on epsilon. The resulting sample complexity is order d to the power p over 2(p-1) times epsilon to the power minus (3p-2) over (p-1), and the same clipping idea yields a high-probability guarantee even at p=2 under conditions that permit infinite variance.

Core claim

Under sample-wise smoothness and a weak-L_p tail condition on the sample-gradient noise, the Robust Scalar-Clipped Zeroth-Order method finds an epsilon-stationary point with high probability after using order d to the p over 2(p-1) times epsilon to the minus (3p-2) over (p-1) noisy function evaluations. At p equals 2 the bound simplifies to order d epsilon to the minus 4, recovering the classical stochastic zeroth-order dependence but under weaker noise conditions and with high probability.

What carries the argument

The Robust Scalar-Clipped Zeroth-Order (RSC-ZO) method, which clips each scalar directional derivative estimate before aggregation to control heavy-tailed noise.

Load-bearing premise

The weak-L_p tail bound on the vector gradient noise transfers to the scalar directional finite-difference estimates without extra assumptions that would invalidate the high-probability guarantee.

What would settle it

A concrete smooth nonconvex function together with a noise distribution obeying the weak-L_p condition for which the clipped directional estimates still produce tails too heavy for the stated convergence rate to hold.

Figures

Figures reproduced from arXiv: 2605.17394 by El Mahdi Chayti, Imane Rahali, Omar Saadi, Qiuyi Zhang, Taha El Bakkali.

**Figure 2.** Figure 2: Tuning curves for the representative setting [PITH_FULL_IMAGE:figures/full_fig_p036_2.png] view at source ↗

**Figure 3.** Figure 3: Representative optimization curves for d = 100, p = 1.5. We plot ∥∇f(xt)∥2 = ∥xt∥2 versus the number of iterations. Raw zeroth-order optimization is unstable under heavy-tailed noise. Final-vector clipping stabilizes the method but plateaus above the target. Scalar clipping before aggregation reaches the target and obtains the lowest final stationarity measure. Representative optimization behavior [PITH_F… view at source ↗

**Figure 4.** Figure 4: Direction diagnostic for d = 100, p = 1.5. We report the cosine similarity between each stochastic estimator and the true gradient. Final-vector clipping has essentially the same alignment as the raw estimator because it preserves the raw direction. Scalar clipping before aggregation yields much better alignment, explaining its improved optimization performance [PITH_FULL_IMAGE:figures/full_fig_p038_4.png] view at source ↗

**Figure 5.** Figure 5: Dimension sweep for Pareto tail exponent [PITH_FULL_IMAGE:figures/full_fig_p038_5.png] view at source ↗

**Figure 6.** Figure 6: Tail-index sweep for d = 100. Smaller p corresponds to heavier-tailed noise. Scalar clipping gives the largest improvement in the infinite-variance regime p < 2. We also include p = 2.5 as a finite-variance sanity check, where the gap becomes smaller but scalar clipping remains competitive [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗

**Figure 7.** Figure 7: Outlier dominance diagnostic for the representative setting. We plot the distribution of [PITH_FULL_IMAGE:figures/full_fig_p039_7.png] view at source ↗

read the original abstract

We study stochastic zeroth-order (ZO) optimization of smooth nonconvex objectives under heavy-tailed sample-gradient noise. This regime is motivated by empirical evidence that gradient noise in modern machine learning can violate the bounded-variance assumptions used in classical ZO theory. While first-order methods have optimal rates under bounded $p$-th moment noise for $p\in(1,2]$, analogous high-probability guarantees for nonconvex ZO methods are much less understood. The ZO setting is not a direct corollary of first-order theory. First-order methods observe stochastic gradients, whereas derivative-free methods only query noisy function values and build finite-difference estimates. Thus, weak-$L_p$ control of $\nabla F(x;\xi)-\nabla f(x)$ must first be transferred to scalar directional estimates. We propose the Robust Scalar-Clipped Zeroth-Order method (RSC-ZO), a two-point method that clips each scalar directional derivative before aggregation. Under sample-wise smoothness and a weak-$L_p$ tail condition on the sample-gradient noise, RSC-ZO finds an $\varepsilon$-stationary point with high probability using $$ \widetilde{O}\!\left( d^{\frac{p}{2(p-1)}}\varepsilon^{-\frac{3p-2}{p-1}} \right) $$ noisy function evaluations. This matches the optimal first-order $\varepsilon$-dependence. At $p=2$, the bound becomes $\widetilde{O}(d\varepsilon^{-4})$, matching the classical stochastic ZO dimension--accuracy dependence, but with a high-probability guarantee and under a weaker weak-$L_2$ condition that can allow infinite variance. We also analyze a momentum variant and quantify its batch-size/stepsize tradeoff.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first high-probability nonconvex ZO rates under heavy-tailed noise that keep the first-order epsilon dependence, via a clipped scalar estimator.

read the letter

This paper gives high-probability rates for nonconvex zeroth-order stochastic optimization when gradient noise has only p-th moments for p in (1,2]. The main point is that they match the epsilon dependence of first-order methods while staying derivative-free. They introduce RSC-ZO, which clips each scalar directional finite-difference before aggregation, and prove a transfer of the weak-L_p tail control from the sample gradients to these estimates. The resulting bound is Õ(d^{p/2(p-1)} ε^{-(3p-2)/(p-1)}) noisy queries, and at p=2 it recovers the usual d ε^{-4} but now with high probability and under a condition that permits infinite variance. They also analyze a momentum variant and the batch-size versus step-size tradeoff. This is useful for ML workloads where variance assumptions often fail. The construction is direct and the assumptions are stated plainly. The soft spot is the transfer step itself. The abstract flags it as nontrivial, and there is a real question whether weak-L_p control on the vector noise carries over to the clipped scalar estimates without extra tail conditions on the function values or uniform control over the sphere. If the proof uses smoothness to bound the difference but then applies Markov-type inequalities without properly handling the union bound across random directions and iterations, the failure probability could exceed δ when p is small and d is large. That would weaken the claimed dimension factor. Since the abstract gives no sketch, the constants and the exact probability control are hard to check from the summary alone. The work is for people working on robust stochastic optimization in nonconvex settings with realistic noise. A reader focused on heavy-tailed analysis or zeroth-order methods would get concrete rates and a new estimator to build on. It deserves a serious referee because the result is new, the setting is relevant, and the central claim is stated sharply enough to be checked.

Referee Report

2 major / 3 minor

Summary. The paper introduces the Robust Scalar-Clipped Zeroth-Order (RSC-ZO) method for stochastic zeroth-order optimization of smooth nonconvex functions under heavy-tailed sample-gradient noise. Under sample-wise smoothness and a weak-L_p tail condition on the noise for p in (1,2], it claims that RSC-ZO finds an ε-stationary point with high probability using Õ(d^{p/2(p-1)} ε^{-(3p-2)/(p-1)}) noisy function evaluations. The key step is transferring the weak-L_p control from vector gradients to clipped scalar directional finite-difference estimates; a momentum variant is also analyzed, and the p=2 case recovers the classical dimension dependence under a weaker assumption allowing infinite variance.

Significance. If the transfer of tail bounds holds with the stated high-probability control, the result supplies the first explicit high-probability ZO rates under heavy-tailed noise that match the optimal first-order ε-dependence. This is valuable for ML applications where gradient noise often violates bounded-variance assumptions. The explicit handling of the nontrivial transfer from vector to scalar estimators, the use of clipping for robustness, and the momentum analysis are concrete strengths. The paper provides a clear statement of the rate and assumptions, supporting potential reproducibility.

major comments (2)

[§4.2, Lemma 4.1] §4.2, the transfer argument (Lemma 4.1): the high-probability bound on the clipped scalar directional estimator is obtained by applying a Markov-type inequality after using smoothness to control the bias from the true directional derivative. It is not immediate that this yields a uniform tail bound over the random direction u drawn from the sphere and over the T iterations without an extra union-bound factor that would alter the claimed d^{p/2(p-1)} dependence when p<2; please supply the precise failure-probability calculation and any hidden logarithmic terms.
[Theorem 4.3] Theorem 4.3 (main high-probability guarantee): the overall sample complexity absorbs the iteration count T and the per-iteration batch size, but the proof sketch in the text does not explicitly verify that the per-iteration failure probability δ/T remains compatible with the weak-L_p tail when the clipping threshold is chosen independently of d. If the constant in the tail bound grows with dimension, the Õ notation may conceal a dependence that weakens the dimension factor.

minor comments (3)

[§2] Notation: the weak-L_p condition is stated on the sample gradient but the clipping is applied to the scalar estimate; a short remark clarifying whether the same p applies directly or requires adjustment would improve readability.
[Figure 1] Figure 1 (convergence plots): the y-axis scale and the number of independent runs used for the shaded regions should be stated explicitly so that the empirical support for the high-probability claim can be assessed.
[§1] References: the discussion of first-order heavy-tailed methods should cite the optimal rates of [specific recent works on p-moment SGD] to make the comparison to the ZO rate fully transparent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and valuable comments on the high-probability analysis of RSC-ZO. We address each major comment below with clarifications on the failure probabilities, union bounds, and dimension dependence. We will revise the manuscript to include the requested explicit calculations while preserving the claimed rates.

read point-by-point responses

Referee: [§4.2, Lemma 4.1] §4.2, the transfer argument (Lemma 4.1): the high-probability bound on the clipped scalar directional estimator is obtained by applying a Markov-type inequality after using smoothness to control the bias from the true directional derivative. It is not immediate that this yields a uniform tail bound over the random direction u drawn from the sphere and over the T iterations without an extra union-bound factor that would alter the claimed d^{p/2(p-1)} dependence when p<2; please supply the precise failure-probability calculation and any hidden logarithmic terms.

Authors: We thank the referee for highlighting the need for a precise accounting. In the proof of Lemma 4.1, the Markov-type tail bound is established conditionally on a fixed iteration, with u drawn uniformly from the sphere; the smoothness assumption controls the bias term uniformly in u. To obtain the simultaneous high-probability statement over all T iterations, we apply a standard union bound by targeting a per-iteration failure probability of δ/T. The resulting tail inequality takes the form P(|clipped scalar estimator − true directional derivative| > t) ≤ C ⋅ (weak-L_p constant) / t^p, where C is independent of both d and the particular realization of u. Setting the clipping threshold to balance bias and tail probability ensures the per-iteration failure probability is at most δ/T. Because each u is sampled independently at its own iteration, no covering argument or uniform bound over the entire sphere is required; the union bound is taken only over the T realized random directions. Consequently, only logarithmic factors in T (and 1/δ) appear. Since T scales polynomially with d^{p/2(p-1)} and ε^{-(3p-2)/(p-1)}, these logs are absorbed in the Õ notation and do not alter the leading exponent of d. We will insert this explicit failure-probability calculation into the revised §4.2. revision: yes
Referee: [Theorem 4.3] Theorem 4.3 (main high-probability guarantee): the overall sample complexity absorbs the iteration count T and the per-iteration batch size, but the proof sketch in the text does not explicitly verify that the per-iteration failure probability δ/T remains compatible with the weak-L_p tail when the clipping threshold is chosen independently of d. If the constant in the tail bound grows with dimension, the Õ notation may conceal a dependence that weakens the dimension factor.

Authors: We agree that an explicit verification strengthens the presentation. In the proof of Theorem 4.3 the clipping threshold is chosen solely from the weak-L_p parameter of the vector noise and the target accuracy ε; it does not depend on d. The transfer from vector to scalar directional estimator (via the inner product with a unit vector u) produces a tail constant that is a fixed multiple of the original weak-L_p constant and introduces no additional polynomial dependence on d beyond the factor already present in the ZO estimator variance, which is accounted for in the d^{p/2(p-1)} term. With this constant fixed, the Markov bound remains compatible with the per-iteration failure probability δ/T for any T that is polynomial in the problem parameters. Any d-independent multiplicative constants are absorbed into the Õ notation without changing the leading dimension exponent. We will expand the proof sketch in the revised manuscript to display this compatibility explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained under explicit assumptions

full rationale

The paper derives high-probability convergence for RSC-ZO from sample-wise smoothness plus a weak-L_p tail condition on sample gradients, with the transfer to clipped directional finite differences presented as a nontrivial technical step rather than a definitional equivalence. No equations reduce a claimed rate or prediction to a fitted parameter by construction, no uniqueness theorem is imported via self-citation, and no ansatz is smuggled through prior work. The bound is obtained by direct analysis of the clipped estimator under the stated tail control, matching first-order dependence only as a comparison, not as a reduction. The derivation therefore stands independently of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on sample-wise smoothness (to control finite-difference bias) and the weak-L_p tail condition on gradient noise; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption sample-wise smoothness of the objective
Required to bound the bias of the two-point finite-difference estimator before clipping.
domain assumption weak-L_p tail bound on sample-gradient noise for p in (1,2]
Central noise model that replaces bounded-variance assumption; enables the clipping analysis.

pith-pipeline@v0.9.0 · 5856 in / 1255 out tokens · 44326 ms · 2026-05-19T23:15:27.794078+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

weak-Lp control of ∇F(x;ξ)−∇f(x) must first be transferred to scalar directional estimates... RSC-ZO... clips each scalar directional derivative before aggregation
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

high-probability guarantee... Õ(d^{p/2(p-1)} ε^{-(3p-2)/(p-1)})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

[1]

SIAM journal on optimization , volume=

Stochastic first-and zeroth-order methods for nonconvex stochastic programming , author=. SIAM journal on optimization , volume=. 2013 , publisher=

work page 2013
[2]

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =

An Empirical Bernstein Inequality for Dependent Data in Hilbert Spaces and Applications , author =. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =. 2025 , editor =

work page 2025
[3]

Foundations of Computational Mathematics , volume =

Nesterov, Yurii and Spokoiny, Vladimir , title =. Foundations of Computational Mathematics , volume =. 2017 , publisher =

work page 2017
[4]

and Jordan, Michael I

Duchi, John C. and Jordan, Michael I. and Wainwright, Martin J. and Wibisono, Andre , title =. IEEE Transactions on Information Theory , volume =. 2015 , doi =

work page 2015
[5]

Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec) , pages =

Chen, Pin-Yu and Zhang, Huan and Sharma, Yash and Yi, Jinfeng and Hsieh, Cho-Jui , title =. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec) , pages =

work page
[6]

Proceedings of the 35th International Conference on Machine Learning (ICML) , year =

Ilyas, Andrew and Engstrom, Logan and Athalye, Anish and Lin, Jessy , title =. Proceedings of the 35th International Conference on Machine Learning (ICML) , year =

work page
[7]

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Salimans, Tim and Ho, Jonathan and Chen, Xi and Sidor, Szymon and Sutskever, Ilya , title =. arXiv preprint arXiv:1703.03864 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[8]

and Kumar, Sanjiv and Sra, Suvrit , title =

Zhang, Jingzhao and Karimireddy, Sai Praneeth and Veit, Andreas and Kim, Seungyeon and Reddi, Sashank J. and Kumar, Sanjiv and Sra, Suvrit , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

work page
[9]

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , booktitle =

work page
[10]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Gorbunov, Eduard and Danilova, Marina and Gasnikov, Alexander , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

work page
[11]

From gradient clipping to normalization for heavy tailed sgd,

H. From Gradient Clipping to Normalization for Heavy Tailed. arXiv preprint arXiv:2410.13849 , year =

work page arXiv
[12]

From Gradient Clipping to Normalization for Heavy Tailed

H. From Gradient Clipping to Normalization for Heavy Tailed. International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

work page
[13]

arXiv preprint arXiv:2402.02461 , year =

Kornilov, Nikita and Dorn, Yuriy and Lobanov, Aleksandr and Kutuzov, Nikolay and Shibaev, Innokentiy and Gorbunov, Eduard and Gasnikov, Alexander and Nazin, Alexander , title =. arXiv preprint arXiv:2402.02461 , year =

work page arXiv
[14]

Accelerated Zeroth-Order Method for Non-Smooth Stochastic Convex Optimization Problem with Infinite Variance , booktitle =

Kornilov, Nikita and Shamir, Ohad and Lobanov, Aleksandr and Dvinskikh, Darina and Gasnikov, Alexander and Shibaev, Innokentiy and Gorbunov, Eduard and Horv. Accelerated Zeroth-Order Method for Non-Smooth Stochastic Convex Optimization Problem with Infinite Variance , booktitle =

work page
[15]

and Pontil, Massimiliano , title =

Mirzaei, Erfan and Maurer, Andreas and Kostic, Vladimir R. and Pontil, Massimiliano , title =. Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) , series =. 2025 , publisher =

work page 2025
[16]

Journal of Machine Learning Research , volume =

Shamir, Ohad , title =. Journal of Machine Learning Research , volume =

work page
[17]

arXiv preprint arXiv:2510.15610 , year =

Chayti, El Mahdi and El Bakkali El Kadi, Taha and Saadi, Omar and Jaggi, Martin , title =. arXiv preprint arXiv:2510.15610 , year =

work page arXiv
[18]

2025 , journal=

Improving Stochastic Cubic Newton with Momentum , author=. 2025 , journal=

work page 2025
[19]

and Chen, Danqi and Arora, Sanjeev , title =

Malladi, Sadhika and Gao, Tianyu and Nichani, Eshaan and Damian, Alex and Lee, Jason D. and Chen, Danqi and Arora, Sanjeev , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

work page
[20]

Zico and Lipton, Zachary C

Garg, Saurabh and Zhanson, Joshua and Parisotto, Emilio and Prasad, Adarsh and Kolter, J. Zico and Lipton, Zachary C. and Balakrishnan, Sivaraman and Salakhutdinov, Ruslan and Ravikumar, Pradeep , title =. Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

work page
[21]

2026 , journal=

RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization , author=. 2026 , journal=

work page 2026
[22]

Nazin, A. V. and Nemirovsky, A. S. and Tsybakov, A. B. and Juditsky, A. B. , title =. Automation and Remote Control , volume =

work page
[23]

High-Probability Bounds for Stochastic Optimization and Variational Inequalities: The Case of Unbounded Variance , booktitle =

Sadiev, Abdurakhmon and Danilova, Marina and Gorbunov, Eduard and Horv. High-Probability Bounds for Stochastic Optimization and Variational Inequalities: The Case of Unbounded Variance , booktitle =

work page
[24]

Proceedings of the 36th Annual Conference on Learning Theory (COLT) , year =

Liu, Zijian and Zhang, Jiawei and Zhou, Zhengyuan , title =. Proceedings of the 36th Annual Conference on Learning Theory (COLT) , year =

work page
[25]

Online convex optimization in the bandit setting: gradient descent without a gradient

Online convex optimization in the bandit setting: gradient descent without a gradient , author=. arXiv preprint cs/0408007 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

International Conference on Learning Representations (ICLR) , year =

Liu, Zijian and Zhou, Zhengyuan , title =. International Conference on Learning Representations (ICLR) , year =

work page
[27]

Computational Management Science , volume =

Kornilov, Nikita and Gasnikov, Alexander and Dvurechensky, Pavel and Dvinskikh, Darina , title =. Computational Management Science , volume =

work page
[28]

Advances in Neural Information Processing Systems , volume=

High-probability bounds for non-convex stochastic optimization with heavy tails , author=. Advances in Neural Information Processing Systems , volume=

work page
[29]

Advances in Neural Information Processing Systems , volume=

Improved convergence in high probability of clipped gradient methods with heavy tailed noise , author=. Advances in Neural Information Processing Systems , volume=

work page
[30]

2009 , publisher=

Introduction to derivative-free optimization , author=. 2009 , publisher=

work page 2009
[31]

2015 , publisher=

Handbook of simulation optimization , author=. 2015 , publisher=

work page 2015
[32]

Journal of Machine Learning Research , volume=

Random search for hyper-parameter optimization , author=. Journal of Machine Learning Research , volume=

work page
[33]

Advances in Neural Information Processing Systems , volume=

Practical bayesian optimization of machine learning algorithms , author=. Advances in Neural Information Processing Systems , volume=

work page
[34]

International Conference on Learning Representations , year=

Why gradient clipping accelerates training: A theoretical justification for adaptivity , author=. International Conference on Learning Representations , year=

work page
[35]

USSR Computational Mathematics and Mathematical Physics , volume=

Some methods of speeding up the convergence of iteration methods , author=. USSR Computational Mathematics and Mathematical Physics , volume=. 1964 , publisher=

work page 1964
[36]

Advances in Neural Information Processing Systems , volume=

Momentum-based variance reduction in non-convex SGD , author=. Advances in Neural Information Processing Systems , volume=

work page
[37]

Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , year =

Building a question answering test collection , author =. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , year =

work page
[38]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , year =

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , author =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , year =

work page 2018
[39]

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages =

A Large Annotated Corpus for Learning Natural Language Inference , author =. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2015
[40]

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal =

work page
[41]

Making Pre-Trained Language Models Better Few-Shot Learners , author =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages =

work page

[1] [1]

SIAM journal on optimization , volume=

Stochastic first-and zeroth-order methods for nonconvex stochastic programming , author=. SIAM journal on optimization , volume=. 2013 , publisher=

work page 2013

[2] [2]

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =

An Empirical Bernstein Inequality for Dependent Data in Hilbert Spaces and Applications , author =. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =. 2025 , editor =

work page 2025

[3] [3]

Foundations of Computational Mathematics , volume =

Nesterov, Yurii and Spokoiny, Vladimir , title =. Foundations of Computational Mathematics , volume =. 2017 , publisher =

work page 2017

[4] [4]

and Jordan, Michael I

Duchi, John C. and Jordan, Michael I. and Wainwright, Martin J. and Wibisono, Andre , title =. IEEE Transactions on Information Theory , volume =. 2015 , doi =

work page 2015

[5] [5]

Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec) , pages =

Chen, Pin-Yu and Zhang, Huan and Sharma, Yash and Yi, Jinfeng and Hsieh, Cho-Jui , title =. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec) , pages =

work page

[6] [6]

Proceedings of the 35th International Conference on Machine Learning (ICML) , year =

Ilyas, Andrew and Engstrom, Logan and Athalye, Anish and Lin, Jessy , title =. Proceedings of the 35th International Conference on Machine Learning (ICML) , year =

work page

[7] [7]

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Salimans, Tim and Ho, Jonathan and Chen, Xi and Sidor, Szymon and Sutskever, Ilya , title =. arXiv preprint arXiv:1703.03864 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

and Kumar, Sanjiv and Sra, Suvrit , title =

Zhang, Jingzhao and Karimireddy, Sai Praneeth and Veit, Andreas and Kim, Seungyeon and Reddi, Sashank J. and Kumar, Sanjiv and Sra, Suvrit , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

work page

[9] [9]

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , booktitle =

work page

[10] [10]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Gorbunov, Eduard and Danilova, Marina and Gasnikov, Alexander , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

work page

[11] [11]

From gradient clipping to normalization for heavy tailed sgd,

H. From Gradient Clipping to Normalization for Heavy Tailed. arXiv preprint arXiv:2410.13849 , year =

work page arXiv

[12] [12]

From Gradient Clipping to Normalization for Heavy Tailed

H. From Gradient Clipping to Normalization for Heavy Tailed. International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

work page

[13] [13]

arXiv preprint arXiv:2402.02461 , year =

Kornilov, Nikita and Dorn, Yuriy and Lobanov, Aleksandr and Kutuzov, Nikolay and Shibaev, Innokentiy and Gorbunov, Eduard and Gasnikov, Alexander and Nazin, Alexander , title =. arXiv preprint arXiv:2402.02461 , year =

work page arXiv

[14] [14]

Accelerated Zeroth-Order Method for Non-Smooth Stochastic Convex Optimization Problem with Infinite Variance , booktitle =

Kornilov, Nikita and Shamir, Ohad and Lobanov, Aleksandr and Dvinskikh, Darina and Gasnikov, Alexander and Shibaev, Innokentiy and Gorbunov, Eduard and Horv. Accelerated Zeroth-Order Method for Non-Smooth Stochastic Convex Optimization Problem with Infinite Variance , booktitle =

work page

[15] [15]

and Pontil, Massimiliano , title =

Mirzaei, Erfan and Maurer, Andreas and Kostic, Vladimir R. and Pontil, Massimiliano , title =. Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) , series =. 2025 , publisher =

work page 2025

[16] [16]

Journal of Machine Learning Research , volume =

Shamir, Ohad , title =. Journal of Machine Learning Research , volume =

work page

[17] [17]

arXiv preprint arXiv:2510.15610 , year =

Chayti, El Mahdi and El Bakkali El Kadi, Taha and Saadi, Omar and Jaggi, Martin , title =. arXiv preprint arXiv:2510.15610 , year =

work page arXiv

[18] [18]

2025 , journal=

Improving Stochastic Cubic Newton with Momentum , author=. 2025 , journal=

work page 2025

[19] [19]

and Chen, Danqi and Arora, Sanjeev , title =

Malladi, Sadhika and Gao, Tianyu and Nichani, Eshaan and Damian, Alex and Lee, Jason D. and Chen, Danqi and Arora, Sanjeev , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

work page

[20] [20]

Zico and Lipton, Zachary C

Garg, Saurabh and Zhanson, Joshua and Parisotto, Emilio and Prasad, Adarsh and Kolter, J. Zico and Lipton, Zachary C. and Balakrishnan, Sivaraman and Salakhutdinov, Ruslan and Ravikumar, Pradeep , title =. Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

work page

[21] [21]

2026 , journal=

RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization , author=. 2026 , journal=

work page 2026

[22] [22]

Nazin, A. V. and Nemirovsky, A. S. and Tsybakov, A. B. and Juditsky, A. B. , title =. Automation and Remote Control , volume =

work page

[23] [23]

High-Probability Bounds for Stochastic Optimization and Variational Inequalities: The Case of Unbounded Variance , booktitle =

Sadiev, Abdurakhmon and Danilova, Marina and Gorbunov, Eduard and Horv. High-Probability Bounds for Stochastic Optimization and Variational Inequalities: The Case of Unbounded Variance , booktitle =

work page

[24] [24]

Proceedings of the 36th Annual Conference on Learning Theory (COLT) , year =

Liu, Zijian and Zhang, Jiawei and Zhou, Zhengyuan , title =. Proceedings of the 36th Annual Conference on Learning Theory (COLT) , year =

work page

[25] [25]

Online convex optimization in the bandit setting: gradient descent without a gradient

Online convex optimization in the bandit setting: gradient descent without a gradient , author=. arXiv preprint cs/0408007 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

International Conference on Learning Representations (ICLR) , year =

Liu, Zijian and Zhou, Zhengyuan , title =. International Conference on Learning Representations (ICLR) , year =

work page

[27] [27]

Computational Management Science , volume =

Kornilov, Nikita and Gasnikov, Alexander and Dvurechensky, Pavel and Dvinskikh, Darina , title =. Computational Management Science , volume =

work page

[28] [28]

Advances in Neural Information Processing Systems , volume=

High-probability bounds for non-convex stochastic optimization with heavy tails , author=. Advances in Neural Information Processing Systems , volume=

work page

[29] [29]

Advances in Neural Information Processing Systems , volume=

Improved convergence in high probability of clipped gradient methods with heavy tailed noise , author=. Advances in Neural Information Processing Systems , volume=

work page

[30] [30]

2009 , publisher=

Introduction to derivative-free optimization , author=. 2009 , publisher=

work page 2009

[31] [31]

2015 , publisher=

Handbook of simulation optimization , author=. 2015 , publisher=

work page 2015

[32] [32]

Journal of Machine Learning Research , volume=

Random search for hyper-parameter optimization , author=. Journal of Machine Learning Research , volume=

work page

[33] [33]

Advances in Neural Information Processing Systems , volume=

Practical bayesian optimization of machine learning algorithms , author=. Advances in Neural Information Processing Systems , volume=

work page

[34] [34]

International Conference on Learning Representations , year=

Why gradient clipping accelerates training: A theoretical justification for adaptivity , author=. International Conference on Learning Representations , year=

work page

[35] [35]

USSR Computational Mathematics and Mathematical Physics , volume=

Some methods of speeding up the convergence of iteration methods , author=. USSR Computational Mathematics and Mathematical Physics , volume=. 1964 , publisher=

work page 1964

[36] [36]

Advances in Neural Information Processing Systems , volume=

Momentum-based variance reduction in non-convex SGD , author=. Advances in Neural Information Processing Systems , volume=

work page

[37] [37]

Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , year =

Building a question answering test collection , author =. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , year =

work page

[38] [38]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , year =

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , author =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , year =

work page 2018

[39] [39]

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages =

A Large Annotated Corpus for Learning Natural Language Inference , author =. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2015

[40] [40]

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal =

work page

[41] [41]

Making Pre-Trained Language Models Better Few-Shot Learners , author =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages =

work page