Stochastic Zeroth-Order Optimization Under Heavy-Tailed Noise
Pith reviewed 2026-05-19 23:15 UTC · model grok-4.3
The pith
Clipped scalar directional estimates let zeroth-order methods find stationary points under heavy-tailed noise with near-optimal query rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under sample-wise smoothness and a weak-L_p tail condition on the sample-gradient noise, the Robust Scalar-Clipped Zeroth-Order method finds an epsilon-stationary point with high probability after using order d to the p over 2(p-1) times epsilon to the minus (3p-2) over (p-1) noisy function evaluations. At p equals 2 the bound simplifies to order d epsilon to the minus 4, recovering the classical stochastic zeroth-order dependence but under weaker noise conditions and with high probability.
What carries the argument
The Robust Scalar-Clipped Zeroth-Order (RSC-ZO) method, which clips each scalar directional derivative estimate before aggregation to control heavy-tailed noise.
Load-bearing premise
The weak-L_p tail bound on the vector gradient noise transfers to the scalar directional finite-difference estimates without extra assumptions that would invalidate the high-probability guarantee.
What would settle it
A concrete smooth nonconvex function together with a noise distribution obeying the weak-L_p condition for which the clipped directional estimates still produce tails too heavy for the stated convergence rate to hold.
Figures
read the original abstract
We study stochastic zeroth-order (ZO) optimization of smooth nonconvex objectives under heavy-tailed sample-gradient noise. This regime is motivated by empirical evidence that gradient noise in modern machine learning can violate the bounded-variance assumptions used in classical ZO theory. While first-order methods have optimal rates under bounded $p$-th moment noise for $p\in(1,2]$, analogous high-probability guarantees for nonconvex ZO methods are much less understood. The ZO setting is not a direct corollary of first-order theory. First-order methods observe stochastic gradients, whereas derivative-free methods only query noisy function values and build finite-difference estimates. Thus, weak-$L_p$ control of $\nabla F(x;\xi)-\nabla f(x)$ must first be transferred to scalar directional estimates. We propose the Robust Scalar-Clipped Zeroth-Order method (RSC-ZO), a two-point method that clips each scalar directional derivative before aggregation. Under sample-wise smoothness and a weak-$L_p$ tail condition on the sample-gradient noise, RSC-ZO finds an $\varepsilon$-stationary point with high probability using $$ \widetilde{O}\!\left( d^{\frac{p}{2(p-1)}}\varepsilon^{-\frac{3p-2}{p-1}} \right) $$ noisy function evaluations. This matches the optimal first-order $\varepsilon$-dependence. At $p=2$, the bound becomes $\widetilde{O}(d\varepsilon^{-4})$, matching the classical stochastic ZO dimension--accuracy dependence, but with a high-probability guarantee and under a weaker weak-$L_2$ condition that can allow infinite variance. We also analyze a momentum variant and quantify its batch-size/stepsize tradeoff.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Robust Scalar-Clipped Zeroth-Order (RSC-ZO) method for stochastic zeroth-order optimization of smooth nonconvex functions under heavy-tailed sample-gradient noise. Under sample-wise smoothness and a weak-L_p tail condition on the noise for p in (1,2], it claims that RSC-ZO finds an ε-stationary point with high probability using Õ(d^{p/2(p-1)} ε^{-(3p-2)/(p-1)}) noisy function evaluations. The key step is transferring the weak-L_p control from vector gradients to clipped scalar directional finite-difference estimates; a momentum variant is also analyzed, and the p=2 case recovers the classical dimension dependence under a weaker assumption allowing infinite variance.
Significance. If the transfer of tail bounds holds with the stated high-probability control, the result supplies the first explicit high-probability ZO rates under heavy-tailed noise that match the optimal first-order ε-dependence. This is valuable for ML applications where gradient noise often violates bounded-variance assumptions. The explicit handling of the nontrivial transfer from vector to scalar estimators, the use of clipping for robustness, and the momentum analysis are concrete strengths. The paper provides a clear statement of the rate and assumptions, supporting potential reproducibility.
major comments (2)
- [§4.2, Lemma 4.1] §4.2, the transfer argument (Lemma 4.1): the high-probability bound on the clipped scalar directional estimator is obtained by applying a Markov-type inequality after using smoothness to control the bias from the true directional derivative. It is not immediate that this yields a uniform tail bound over the random direction u drawn from the sphere and over the T iterations without an extra union-bound factor that would alter the claimed d^{p/2(p-1)} dependence when p<2; please supply the precise failure-probability calculation and any hidden logarithmic terms.
- [Theorem 4.3] Theorem 4.3 (main high-probability guarantee): the overall sample complexity absorbs the iteration count T and the per-iteration batch size, but the proof sketch in the text does not explicitly verify that the per-iteration failure probability δ/T remains compatible with the weak-L_p tail when the clipping threshold is chosen independently of d. If the constant in the tail bound grows with dimension, the Õ notation may conceal a dependence that weakens the dimension factor.
minor comments (3)
- [§2] Notation: the weak-L_p condition is stated on the sample gradient but the clipping is applied to the scalar estimate; a short remark clarifying whether the same p applies directly or requires adjustment would improve readability.
- [Figure 1] Figure 1 (convergence plots): the y-axis scale and the number of independent runs used for the shaded regions should be stated explicitly so that the empirical support for the high-probability claim can be assessed.
- [§1] References: the discussion of first-order heavy-tailed methods should cite the optimal rates of [specific recent works on p-moment SGD] to make the comparison to the ZO rate fully transparent.
Simulated Author's Rebuttal
We thank the referee for the careful reading and valuable comments on the high-probability analysis of RSC-ZO. We address each major comment below with clarifications on the failure probabilities, union bounds, and dimension dependence. We will revise the manuscript to include the requested explicit calculations while preserving the claimed rates.
read point-by-point responses
-
Referee: [§4.2, Lemma 4.1] §4.2, the transfer argument (Lemma 4.1): the high-probability bound on the clipped scalar directional estimator is obtained by applying a Markov-type inequality after using smoothness to control the bias from the true directional derivative. It is not immediate that this yields a uniform tail bound over the random direction u drawn from the sphere and over the T iterations without an extra union-bound factor that would alter the claimed d^{p/2(p-1)} dependence when p<2; please supply the precise failure-probability calculation and any hidden logarithmic terms.
Authors: We thank the referee for highlighting the need for a precise accounting. In the proof of Lemma 4.1, the Markov-type tail bound is established conditionally on a fixed iteration, with u drawn uniformly from the sphere; the smoothness assumption controls the bias term uniformly in u. To obtain the simultaneous high-probability statement over all T iterations, we apply a standard union bound by targeting a per-iteration failure probability of δ/T. The resulting tail inequality takes the form P(|clipped scalar estimator − true directional derivative| > t) ≤ C ⋅ (weak-L_p constant) / t^p, where C is independent of both d and the particular realization of u. Setting the clipping threshold to balance bias and tail probability ensures the per-iteration failure probability is at most δ/T. Because each u is sampled independently at its own iteration, no covering argument or uniform bound over the entire sphere is required; the union bound is taken only over the T realized random directions. Consequently, only logarithmic factors in T (and 1/δ) appear. Since T scales polynomially with d^{p/2(p-1)} and ε^{-(3p-2)/(p-1)}, these logs are absorbed in the Õ notation and do not alter the leading exponent of d. We will insert this explicit failure-probability calculation into the revised §4.2. revision: yes
-
Referee: [Theorem 4.3] Theorem 4.3 (main high-probability guarantee): the overall sample complexity absorbs the iteration count T and the per-iteration batch size, but the proof sketch in the text does not explicitly verify that the per-iteration failure probability δ/T remains compatible with the weak-L_p tail when the clipping threshold is chosen independently of d. If the constant in the tail bound grows with dimension, the Õ notation may conceal a dependence that weakens the dimension factor.
Authors: We agree that an explicit verification strengthens the presentation. In the proof of Theorem 4.3 the clipping threshold is chosen solely from the weak-L_p parameter of the vector noise and the target accuracy ε; it does not depend on d. The transfer from vector to scalar directional estimator (via the inner product with a unit vector u) produces a tail constant that is a fixed multiple of the original weak-L_p constant and introduces no additional polynomial dependence on d beyond the factor already present in the ZO estimator variance, which is accounted for in the d^{p/2(p-1)} term. With this constant fixed, the Markov bound remains compatible with the per-iteration failure probability δ/T for any T that is polynomial in the problem parameters. Any d-independent multiplicative constants are absorbed into the Õ notation without changing the leading dimension exponent. We will expand the proof sketch in the revised manuscript to display this compatibility explicitly. revision: yes
Circularity Check
No circularity: derivation is self-contained under explicit assumptions
full rationale
The paper derives high-probability convergence for RSC-ZO from sample-wise smoothness plus a weak-L_p tail condition on sample gradients, with the transfer to clipped directional finite differences presented as a nontrivial technical step rather than a definitional equivalence. No equations reduce a claimed rate or prediction to a fitted parameter by construction, no uniqueness theorem is imported via self-citation, and no ansatz is smuggled through prior work. The bound is obtained by direct analysis of the clipped estimator under the stated tail control, matching first-order dependence only as a comparison, not as a reduction. The derivation therefore stands independently of its inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption sample-wise smoothness of the objective
- domain assumption weak-L_p tail bound on sample-gradient noise for p in (1,2]
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
weak-Lp control of ∇F(x;ξ)−∇f(x) must first be transferred to scalar directional estimates... RSC-ZO... clips each scalar directional derivative before aggregation
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
high-probability guarantee... Õ(d^{p/2(p-1)} ε^{-(3p-2)/(p-1)})
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
SIAM journal on optimization , volume=
Stochastic first-and zeroth-order methods for nonconvex stochastic programming , author=. SIAM journal on optimization , volume=. 2013 , publisher=
work page 2013
-
[2]
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =
An Empirical Bernstein Inequality for Dependent Data in Hilbert Spaces and Applications , author =. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =. 2025 , editor =
work page 2025
-
[3]
Foundations of Computational Mathematics , volume =
Nesterov, Yurii and Spokoiny, Vladimir , title =. Foundations of Computational Mathematics , volume =. 2017 , publisher =
work page 2017
-
[4]
Duchi, John C. and Jordan, Michael I. and Wainwright, Martin J. and Wibisono, Andre , title =. IEEE Transactions on Information Theory , volume =. 2015 , doi =
work page 2015
-
[5]
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec) , pages =
Chen, Pin-Yu and Zhang, Huan and Sharma, Yash and Yi, Jinfeng and Hsieh, Cho-Jui , title =. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec) , pages =
-
[6]
Proceedings of the 35th International Conference on Machine Learning (ICML) , year =
Ilyas, Andrew and Engstrom, Logan and Athalye, Anish and Lin, Jessy , title =. Proceedings of the 35th International Conference on Machine Learning (ICML) , year =
-
[7]
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Salimans, Tim and Ho, Jonathan and Chen, Xi and Sidor, Szymon and Sutskever, Ilya , title =. arXiv preprint arXiv:1703.03864 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
and Kumar, Sanjiv and Sra, Suvrit , title =
Zhang, Jingzhao and Karimireddy, Sai Praneeth and Veit, Andreas and Kim, Seungyeon and Reddi, Sashank J. and Kumar, Sanjiv and Sra, Suvrit , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =
-
[9]
A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , booktitle =
-
[10]
Advances in Neural Information Processing Systems (NeurIPS) , volume =
Gorbunov, Eduard and Danilova, Marina and Gasnikov, Alexander , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =
-
[11]
From gradient clipping to normalization for heavy tailed sgd,
H. From Gradient Clipping to Normalization for Heavy Tailed. arXiv preprint arXiv:2410.13849 , year =
-
[12]
From Gradient Clipping to Normalization for Heavy Tailed
H. From Gradient Clipping to Normalization for Heavy Tailed. International Conference on Artificial Intelligence and Statistics (AISTATS) , year=
-
[13]
arXiv preprint arXiv:2402.02461 , year =
Kornilov, Nikita and Dorn, Yuriy and Lobanov, Aleksandr and Kutuzov, Nikolay and Shibaev, Innokentiy and Gorbunov, Eduard and Gasnikov, Alexander and Nazin, Alexander , title =. arXiv preprint arXiv:2402.02461 , year =
-
[14]
Kornilov, Nikita and Shamir, Ohad and Lobanov, Aleksandr and Dvinskikh, Darina and Gasnikov, Alexander and Shibaev, Innokentiy and Gorbunov, Eduard and Horv. Accelerated Zeroth-Order Method for Non-Smooth Stochastic Convex Optimization Problem with Infinite Variance , booktitle =
-
[15]
and Pontil, Massimiliano , title =
Mirzaei, Erfan and Maurer, Andreas and Kostic, Vladimir R. and Pontil, Massimiliano , title =. Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) , series =. 2025 , publisher =
work page 2025
-
[16]
Journal of Machine Learning Research , volume =
Shamir, Ohad , title =. Journal of Machine Learning Research , volume =
-
[17]
arXiv preprint arXiv:2510.15610 , year =
Chayti, El Mahdi and El Bakkali El Kadi, Taha and Saadi, Omar and Jaggi, Martin , title =. arXiv preprint arXiv:2510.15610 , year =
-
[18]
Improving Stochastic Cubic Newton with Momentum , author=. 2025 , journal=
work page 2025
-
[19]
and Chen, Danqi and Arora, Sanjeev , title =
Malladi, Sadhika and Gao, Tianyu and Nichani, Eshaan and Damian, Alex and Lee, Jason D. and Chen, Danqi and Arora, Sanjeev , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =
-
[20]
Garg, Saurabh and Zhanson, Joshua and Parisotto, Emilio and Prasad, Adarsh and Kolter, J. Zico and Lipton, Zachary C. and Balakrishnan, Sivaraman and Salakhutdinov, Ruslan and Ravikumar, Pradeep , title =. Proceedings of the 38th International Conference on Machine Learning (ICML) , year =
-
[21]
RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization , author=. 2026 , journal=
work page 2026
-
[22]
Nazin, A. V. and Nemirovsky, A. S. and Tsybakov, A. B. and Juditsky, A. B. , title =. Automation and Remote Control , volume =
-
[23]
Sadiev, Abdurakhmon and Danilova, Marina and Gorbunov, Eduard and Horv. High-Probability Bounds for Stochastic Optimization and Variational Inequalities: The Case of Unbounded Variance , booktitle =
-
[24]
Proceedings of the 36th Annual Conference on Learning Theory (COLT) , year =
Liu, Zijian and Zhang, Jiawei and Zhou, Zhengyuan , title =. Proceedings of the 36th Annual Conference on Learning Theory (COLT) , year =
-
[25]
Online convex optimization in the bandit setting: gradient descent without a gradient
Online convex optimization in the bandit setting: gradient descent without a gradient , author=. arXiv preprint cs/0408007 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
International Conference on Learning Representations (ICLR) , year =
Liu, Zijian and Zhou, Zhengyuan , title =. International Conference on Learning Representations (ICLR) , year =
-
[27]
Computational Management Science , volume =
Kornilov, Nikita and Gasnikov, Alexander and Dvurechensky, Pavel and Dvinskikh, Darina , title =. Computational Management Science , volume =
-
[28]
Advances in Neural Information Processing Systems , volume=
High-probability bounds for non-convex stochastic optimization with heavy tails , author=. Advances in Neural Information Processing Systems , volume=
-
[29]
Advances in Neural Information Processing Systems , volume=
Improved convergence in high probability of clipped gradient methods with heavy tailed noise , author=. Advances in Neural Information Processing Systems , volume=
-
[30]
Introduction to derivative-free optimization , author=. 2009 , publisher=
work page 2009
- [31]
-
[32]
Journal of Machine Learning Research , volume=
Random search for hyper-parameter optimization , author=. Journal of Machine Learning Research , volume=
-
[33]
Advances in Neural Information Processing Systems , volume=
Practical bayesian optimization of machine learning algorithms , author=. Advances in Neural Information Processing Systems , volume=
-
[34]
International Conference on Learning Representations , year=
Why gradient clipping accelerates training: A theoretical justification for adaptivity , author=. International Conference on Learning Representations , year=
-
[35]
USSR Computational Mathematics and Mathematical Physics , volume=
Some methods of speeding up the convergence of iteration methods , author=. USSR Computational Mathematics and Mathematical Physics , volume=. 1964 , publisher=
work page 1964
-
[36]
Advances in Neural Information Processing Systems , volume=
Momentum-based variance reduction in non-convex SGD , author=. Advances in Neural Information Processing Systems , volume=
-
[37]
Building a question answering test collection , author =. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , year =
-
[38]
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , author =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , year =
work page 2018
-
[39]
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages =
A Large Annotated Corpus for Learning Natural Language Inference , author =. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages =
work page 2015
-
[40]
Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal =
-
[41]
Making Pre-Trained Language Models Better Few-Shot Learners , author =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.