High-Probability Last-Iterate Guarantees for Two-Point Gaussian Zeroth-Order Stochastic Gradient Descent
Pith reviewed 2026-06-26 15:51 UTC · model grok-4.3
The pith
Two-point Gaussian zeroth-order SGD attains Õ(d/T) last-iterate accuracy with high probability when dimension exceeds 16 log(6T/δ).
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The standard same-sample two-point Gaussian zeroth-order SGD satisfies f(x_T) - f(x*) = Õ(d/T) with probability at least 1-δ whenever d ≥ 16 log(6T/δ), under unbiased stochastic gradients and the conditional exponential-moment bound on the squared norm of the noise; the analysis is direct and does not invoke Markov's inequality or truncate the noise.
What carries the argument
The uniform weighted scan for Gaussian angles combined with the angle-enlarged product-martingale boundary that controls the signed suffix-product term in the unrolled stochastic recursion.
If this is right
- The dependence on the failure probability δ is only logarithmic.
- The guarantee applies to the unmodified same-sample two-point Gaussian recursion.
- The result holds for smooth strongly convex stochastic objectives.
- No truncation of the noise or conversion from expectation bounds is required.
Where Pith is reading between the lines
- The martingale technique may extend to other zeroth-order recursions that produce similar product terms.
- In regimes where the exponential-moment condition is plausible, the bound supplies tighter tail control than Markov-based alternatives.
- The explicit dimension threshold suggests that low-dimensional instances may require separate handling or different analysis.
Load-bearing premise
The conditional exponential-moment bound on the squared norm of the stochastic-gradient noise.
What would settle it
A direct calculation or simulation in which the product-martingale term exceeds the claimed bound when the noise satisfies only polynomial moments instead of the exponential-moment condition.
read the original abstract
We establish a direct high-probability last-iterate guarantee for the standard same-sample two-point Gaussian zeroth-order stochastic-gradient method applied to smooth, strongly convex stochastic optimization. At each iteration, the method draws a fresh Gaussian direction, evaluates the objective at two symmetric perturbations using the same stochastic sample, and takes a norm-normalized stochastic-approximation step. Assuming unbiased stochastic gradients and a conditional exponential-moment bound on the squared norm of the stochastic-gradient noise, we prove that, whenever \(d\ge16\log(6T/\delta)\), \[ f(\bx_T)-f(\bx^*) = \widetilde{\mathcal O}\!\left(\frac{d}{T}\right) \] with probability at least \(1-\delta\), up to fixed problem parameters and logarithmic factors. The confidence dependence is therefore logarithmic rather than polynomial in \(1/\delta\). The analysis is direct: it neither invokes Markov's inequality to convert an expectation bound nor truncates the noise. We are not aware of a prior direct high-probability last-iterate result at this zeroth-order scale for the same-sample Gaussian recursion under conditional sub-Gaussian stochastic-gradient noise. The proof combines a uniform weighted scan for Gaussian angles with an angle-enlarged product-martingale boundary that controls the signed suffix-product term arising from the unrolled stochastic recursion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves a direct high-probability last-iterate bound for the same-sample two-point Gaussian zeroth-order SGD on smooth strongly convex stochastic optimization. Under unbiased stochastic gradients and a conditional exponential-moment bound on the squared norm of the noise, it shows that whenever d ≥ 16 log(6T/δ), f(x_T) − f(x*) = Õ(d/T) holds with probability at least 1−δ. The analysis relies on a uniform weighted scan over Gaussian angles combined with an angle-enlarged product-martingale boundary to control the signed suffix-product term in the unrolled recursion, avoiding both Markov conversion and truncation.
Significance. If the central derivation holds, the result supplies the first direct high-probability last-iterate guarantee at the natural zeroth-order scale for this recursion under conditional sub-Gaussian noise, with only logarithmic dependence on 1/δ. The technical devices (weighted Gaussian-angle scan and enlarged product-martingale boundary) are of independent interest for high-probability analyses of stochastic recursions.
major comments (1)
- [§3, Theorem 3.1] §3 (Theorem 3.1 and the product-martingale boundary construction): the angle-enlarged boundary is stated to control the signed suffix-product term without truncation, but the precise enlargement factor and the invocation of the conditional exponential-moment assumption to close the supermartingale inequality are not cross-referenced to an explicit lemma; a self-contained verification that the boundary remains valid under the stated noise condition would strengthen the claim.
minor comments (2)
- [Theorem 3.1] The dimension threshold d ≥ 16 log(6T/δ) appears in the statement of the main theorem; a short remark explaining how the constant 16 arises from the Gaussian concentration tail used in the angle scan would improve readability.
- [§2] Notation for the two-point estimator (same-sample vs. independent-sample) is introduced in the abstract and §1 but could be restated once in the algorithm box for clarity.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the constructive suggestion regarding the clarity of the product-martingale boundary argument. We address the single major comment below.
read point-by-point responses
-
Referee: [§3, Theorem 3.1] §3 (Theorem 3.1 and the product-martingale boundary construction): the angle-enlarged boundary is stated to control the signed suffix-product term without truncation, but the precise enlargement factor and the invocation of the conditional exponential-moment assumption to close the supermartingale inequality are not cross-referenced to an explicit lemma; a self-contained verification that the boundary remains valid under the stated noise condition would strengthen the claim.
Authors: We agree that an explicit cross-reference and self-contained verification would improve readability. In the revision we will insert a dedicated lemma (placed immediately before Theorem 3.1) that states the precise enlargement factor, recalls the conditional exponential-moment assumption, and verifies that the angle-enlarged process remains a supermartingale. The proof of the lemma will be self-contained and will not alter the subsequent argument or the stated rates. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents a direct mathematical proof of a high-probability last-iterate convergence bound for two-point Gaussian zeroth-order SGD, starting from explicit assumptions (unbiased stochastic gradients and conditional exponential-moment bound on noise) and deriving the stated Õ(d/T) rate under the dimension condition d ≥ 16 log(6T/δ). No steps reduce by construction to fitted parameters, renamed empirical patterns, or load-bearing self-citations; the argument relies on martingale concentration and Gaussian angle scanning without invoking prior results by the same author as an unverified uniqueness theorem or ansatz. The derivation is self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption The objective function is smooth and strongly convex.
- domain assumption Stochastic gradients are unbiased.
- domain assumption Conditional exponential-moment bound on the squared norm of the stochastic-gradient noise.
Reference graph
Works this paper leans on
-
[1]
Agarwal, A., Dekel, O., & Xiao, L. (2010). Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback . In Proceedings of the 23rd Annual Conference on Learning Theory (pp.\ 28--40)
2010
-
[2]
E., & Nocedal, J
Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization Methods for Large-Scale Machine Learning . SIAM Review , 60(2), 223--311
2018
-
[3]
R., Scheinberg, K., & Vicente, L
Conn, A. R., Scheinberg, K., & Vicente, L. N. (2009). Introduction to Derivative-Free Optimization . Philadelphia, PA: SIAM
2009
-
[4]
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs . arXiv preprint arXiv:2305.14314
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
C., Jordan, M
Duchi, J. C., Jordan, M. I., Wainwright, M. J., & Wibisono, A. (2015). Optimal Rates for Zero-Order Convex Optimization: The Power of Two Function Evaluations . IEEE Transactions on Information Theory , 61(5), 2788--2806
2015
-
[6]
D., Kalai, A
Flaxman, A. D., Kalai, A. T., & McMahan, H. B. (2005). Online Convex Optimization in the Bandit Setting: Gradient Descent without a Gradient . In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (pp.\ 385--394)
2005
-
[7]
& Lan, G
Ghadimi, S. & Lan, G. (2013). Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming . SIAM Journal on Optimization , 23(4), 2341--2368
2013
- [8]
-
[9]
J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models . In International Conference on Learning Representations
2022
-
[10]
Kornilov, N., Shamir, O., Lobanov, A., Dvinskikh, D., Gasnikov, A., Shibaev, I., Gorbunov, E., & Horv \'a th, S. (2023). Accelerated Zeroth-Order Method for Non-Smooth Stochastic Convex Optimization Problem with Infinite Variance . In Advances in Neural Information Processing Systems , volume 36
2023
-
[11]
& Massart, P
Laurent, B. & Massart, P. (2000). Adaptive Estimation of a Quadratic Functional by Model Selection . Annals of Statistics , 28(5), 1302--1338
2000
- [12]
-
[13]
Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-Tuning Language Models with Just Forward Passes . arXiv preprint arXiv:2305.17333 . Accepted by NeurIPS 2023 (oral)
-
[14]
Nemirovski, A., Juditsky, A., Lan, G., & Shapiro, A. (2009). Robust Stochastic Approximation Approach to Stochastic Programming . SIAM Journal on Optimization , 19(4), 1574--1609
2009
-
[15]
& Spokoiny, V
Nesterov, Y. & Spokoiny, V. (2017). Random Gradient-Free Minimization of Convex Functions . Foundations of Computational Mathematics , 17(2), 527--566
2017
-
[16]
Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization . In Proceedings of the 29th International Conference on Machine Learning
2012
-
[17]
& Monro, S
Robbins, H. & Monro, S. (1951). A Stochastic Approximation Method . The Annals of Mathematical Statistics , 22(3), 400--407
1951
-
[18]
Shamir, O. (2017). An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback . Journal of Machine Learning Research , 18(52), 1--11
2017
-
[19]
Wang, Y., Ye, H., Liu, Y., Dai, G., Tsang, I., & Wang, J. (2024). The Advancement in Stochastic Zeroth-Order Optimization: Mechanism of Accelerated Convergence of Gaussian Direction on Objectives with Skewed Hessian Eigenvalues . Unpublished manuscript
2024
-
[20]
Ye, H. (2026a). High-Probability Guarantees for Random Zeroth-Order Gradient Descent on Smooth Functions . arXiv preprint arXiv:2605.26547
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Ye, H. (2026b). Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback . arXiv preprint arXiv:2603.25029
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.