High-Probability Last-Iterate Guarantees for Two-Point Gaussian Zeroth-Order Stochastic Gradient Descent

Haishan Ye

arxiv: 2606.20446 · v1 · pith:CPPJ6QUSnew · submitted 2026-06-18 · 🧮 math.OC

High-Probability Last-Iterate Guarantees for Two-Point Gaussian Zeroth-Order Stochastic Gradient Descent

Haishan Ye This is my paper

Pith reviewed 2026-06-26 15:51 UTC · model grok-4.3

classification 🧮 math.OC

keywords zeroth-order optimizationstochastic gradient descenthigh-probability boundslast-iterate convergenceGaussian perturbationssmooth strongly convex optimization

0 comments

The pith

Two-point Gaussian zeroth-order SGD attains Õ(d/T) last-iterate accuracy with high probability when dimension exceeds 16 log(6T/δ).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves a direct high-probability guarantee on the final iterate of the standard same-sample two-point Gaussian zeroth-order stochastic gradient method for smooth strongly convex problems. Under unbiased stochastic gradients and a conditional exponential-moment bound on noise, it shows the function gap is Õ(d/T) with probability 1-δ whenever d is at least 16 log(6T/δ). The proof works directly on the recursion via a uniform weighted scan of Gaussian angles and an angle-enlarged product-martingale bound, avoiding Markov conversion or truncation. A reader would care because the failure probability enters only logarithmically rather than polynomially.

Core claim

The standard same-sample two-point Gaussian zeroth-order SGD satisfies f(x_T) - f(x*) = Õ(d/T) with probability at least 1-δ whenever d ≥ 16 log(6T/δ), under unbiased stochastic gradients and the conditional exponential-moment bound on the squared norm of the noise; the analysis is direct and does not invoke Markov's inequality or truncate the noise.

What carries the argument

The uniform weighted scan for Gaussian angles combined with the angle-enlarged product-martingale boundary that controls the signed suffix-product term in the unrolled stochastic recursion.

If this is right

The dependence on the failure probability δ is only logarithmic.
The guarantee applies to the unmodified same-sample two-point Gaussian recursion.
The result holds for smooth strongly convex stochastic objectives.
No truncation of the noise or conversion from expectation bounds is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The martingale technique may extend to other zeroth-order recursions that produce similar product terms.
In regimes where the exponential-moment condition is plausible, the bound supplies tighter tail control than Markov-based alternatives.
The explicit dimension threshold suggests that low-dimensional instances may require separate handling or different analysis.

Load-bearing premise

The conditional exponential-moment bound on the squared norm of the stochastic-gradient noise.

What would settle it

A direct calculation or simulation in which the product-martingale term exceeds the claimed bound when the noise satisfies only polynomial moments instead of the exponential-moment condition.

read the original abstract

We establish a direct high-probability last-iterate guarantee for the standard same-sample two-point Gaussian zeroth-order stochastic-gradient method applied to smooth, strongly convex stochastic optimization. At each iteration, the method draws a fresh Gaussian direction, evaluates the objective at two symmetric perturbations using the same stochastic sample, and takes a norm-normalized stochastic-approximation step. Assuming unbiased stochastic gradients and a conditional exponential-moment bound on the squared norm of the stochastic-gradient noise, we prove that, whenever \(d\ge16\log(6T/\delta)\), \[ f(\bx_T)-f(\bx^*) = \widetilde{\mathcal O}\!\left(\frac{d}{T}\right) \] with probability at least \(1-\delta\), up to fixed problem parameters and logarithmic factors. The confidence dependence is therefore logarithmic rather than polynomial in \(1/\delta\). The analysis is direct: it neither invokes Markov's inequality to convert an expectation bound nor truncates the noise. We are not aware of a prior direct high-probability last-iterate result at this zeroth-order scale for the same-sample Gaussian recursion under conditional sub-Gaussian stochastic-gradient noise. The proof combines a uniform weighted scan for Gaussian angles with an angle-enlarged product-martingale boundary that controls the signed suffix-product term arising from the unrolled stochastic recursion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Direct high-prob last-iterate bound for two-point Gaussian ZO-SGD with log δ dependence via weighted angle scan and enlarged product-martingale.

read the letter

The main takeaway is a direct high-probability last-iterate guarantee for the standard same-sample two-point Gaussian zeroth-order SGD on smooth strongly convex stochastic problems. Under unbiased gradients and a conditional exponential-moment bound on the squared noise norm, it gets f(x_T) - f* = Õ(d/T) with probability 1-δ whenever d ≥ 16 log(6T/δ). The logarithmic dependence on 1/δ is the concrete improvement over polynomial alternatives.

The analysis is direct: it skips Markov conversion and truncation by combining a uniform weighted scan over Gaussian angles with an angle-enlarged product-martingale boundary to control the signed suffix-product term from the unrolled recursion. The stress-test found no internal inconsistency or hidden step in that argument, and the dimension threshold follows from Gaussian concentration for the scan. That combination appears to be the technical step that lets them claim novelty for this exact recursion.

The noise-moment assumption does the heavy lifting to bound the martingale without extra devices. It is stronger than a plain variance bound but fits the setting they target. The paper does not explore whether the dimension threshold is necessary or how the constants behave in practice, but those are secondary.

This is for researchers working on high-probability bounds or zeroth-order methods in stochastic optimization. A reader focused on last-iterate convergence or black-box algorithms would get value from the technique. The claim is specific, the approach is grounded in the stated assumptions, and it addresses a gap they identify, so it deserves a serious referee.

Referee Report

1 major / 2 minor

Summary. The paper proves a direct high-probability last-iterate bound for the same-sample two-point Gaussian zeroth-order SGD on smooth strongly convex stochastic optimization. Under unbiased stochastic gradients and a conditional exponential-moment bound on the squared norm of the noise, it shows that whenever d ≥ 16 log(6T/δ), f(x_T) − f(x*) = Õ(d/T) holds with probability at least 1−δ. The analysis relies on a uniform weighted scan over Gaussian angles combined with an angle-enlarged product-martingale boundary to control the signed suffix-product term in the unrolled recursion, avoiding both Markov conversion and truncation.

Significance. If the central derivation holds, the result supplies the first direct high-probability last-iterate guarantee at the natural zeroth-order scale for this recursion under conditional sub-Gaussian noise, with only logarithmic dependence on 1/δ. The technical devices (weighted Gaussian-angle scan and enlarged product-martingale boundary) are of independent interest for high-probability analyses of stochastic recursions.

major comments (1)

[§3, Theorem 3.1] §3 (Theorem 3.1 and the product-martingale boundary construction): the angle-enlarged boundary is stated to control the signed suffix-product term without truncation, but the precise enlargement factor and the invocation of the conditional exponential-moment assumption to close the supermartingale inequality are not cross-referenced to an explicit lemma; a self-contained verification that the boundary remains valid under the stated noise condition would strengthen the claim.

minor comments (2)

[Theorem 3.1] The dimension threshold d ≥ 16 log(6T/δ) appears in the statement of the main theorem; a short remark explaining how the constant 16 arises from the Gaussian concentration tail used in the angle scan would improve readability.
[§2] Notation for the two-point estimator (same-sample vs. independent-sample) is introduced in the abstract and §1 but could be restated once in the algorithm box for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the constructive suggestion regarding the clarity of the product-martingale boundary argument. We address the single major comment below.

read point-by-point responses

Referee: [§3, Theorem 3.1] §3 (Theorem 3.1 and the product-martingale boundary construction): the angle-enlarged boundary is stated to control the signed suffix-product term without truncation, but the precise enlargement factor and the invocation of the conditional exponential-moment assumption to close the supermartingale inequality are not cross-referenced to an explicit lemma; a self-contained verification that the boundary remains valid under the stated noise condition would strengthen the claim.

Authors: We agree that an explicit cross-reference and self-contained verification would improve readability. In the revision we will insert a dedicated lemma (placed immediately before Theorem 3.1) that states the precise enlargement factor, recalls the conditional exponential-moment assumption, and verifies that the angle-enlarged process remains a supermartingale. The proof of the lemma will be self-contained and will not alter the subsequent argument or the stated rates. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a direct mathematical proof of a high-probability last-iterate convergence bound for two-point Gaussian zeroth-order SGD, starting from explicit assumptions (unbiased stochastic gradients and conditional exponential-moment bound on noise) and deriving the stated Õ(d/T) rate under the dimension condition d ≥ 16 log(6T/δ). No steps reduce by construction to fitted parameters, renamed empirical patterns, or load-bearing self-citations; the argument relies on martingale concentration and Gaussian angle scanning without invoking prior results by the same author as an unverified uniqueness theorem or ansatz. The derivation is self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The claim rests on three standard domain assumptions plus one specialized noise moment condition; no free parameters or invented entities are introduced.

axioms (3)

domain assumption The objective function is smooth and strongly convex.
Invoked to obtain the linear convergence rate in the deterministic case and to control the bias term.
domain assumption Stochastic gradients are unbiased.
Standard assumption allowing the expectation of the stochastic approximation step to align with the true gradient.
domain assumption Conditional exponential-moment bound on the squared norm of the stochastic-gradient noise.
Enables direct control of the product-martingale term without truncation or conversion from expectation.

pith-pipeline@v0.9.1-grok · 5766 in / 1422 out tokens · 30338 ms · 2026-06-26T15:51:31.371137+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Agarwal, A., Dekel, O., & Xiao, L. (2010). Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback . In Proceedings of the 23rd Annual Conference on Learning Theory (pp.\ 28--40)

2010
[2]

E., & Nocedal, J

Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization Methods for Large-Scale Machine Learning . SIAM Review , 60(2), 223--311

2018
[3]

R., Scheinberg, K., & Vicente, L

Conn, A. R., Scheinberg, K., & Vicente, L. N. (2009). Introduction to Derivative-Free Optimization . Philadelphia, PA: SIAM

2009
[4]

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs . arXiv preprint arXiv:2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

C., Jordan, M

Duchi, J. C., Jordan, M. I., Wainwright, M. J., & Wibisono, A. (2015). Optimal Rates for Zero-Order Convex Optimization: The Power of Two Function Evaluations . IEEE Transactions on Information Theory , 61(5), 2788--2806

2015
[6]

D., Kalai, A

Flaxman, A. D., Kalai, A. T., & McMahan, H. B. (2005). Online Convex Optimization in the Bandit Setting: Gradient Descent without a Gradient . In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (pp.\ 385--394)

2005
[7]

& Lan, G

Ghadimi, S. & Lan, G. (2013). Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming . SIAM Journal on Optimization , 23(4), 2341--2368

2013
[8]

Gorbunov, E., Danilova, M., Shibaev, I., Dvurechensky, P., & Gasnikov, A. (2021). High-Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise . arXiv preprint arXiv:2106.05958

work page arXiv 2021
[9]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models . In International Conference on Learning Representations

2022
[10]

Kornilov, N., Shamir, O., Lobanov, A., Dvinskikh, D., Gasnikov, A., Shibaev, I., Gorbunov, E., & Horv \'a th, S. (2023). Accelerated Zeroth-Order Method for Non-Smooth Stochastic Convex Optimization Problem with Infinite Variance . In Advances in Neural Information Processing Systems , volume 36

2023
[11]

& Massart, P

Laurent, B. & Massart, P. (2000). Adaptive Estimation of a Quadratic Functional by Model Selection . Annals of Statistics , 28(5), 1302--1338

2000
[12]

& Zhou, Z

Liu, Z. & Zhou, Z. (2023). Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods . arXiv preprint arXiv:2312.08531

work page arXiv 2023
[13]

D., Chen, D., & Arora, S

Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-Tuning Language Models with Just Forward Passes . arXiv preprint arXiv:2305.17333 . Accepted by NeurIPS 2023 (oral)

work page arXiv 2023
[14]

Nemirovski, A., Juditsky, A., Lan, G., & Shapiro, A. (2009). Robust Stochastic Approximation Approach to Stochastic Programming . SIAM Journal on Optimization , 19(4), 1574--1609

2009
[15]

& Spokoiny, V

Nesterov, Y. & Spokoiny, V. (2017). Random Gradient-Free Minimization of Convex Functions . Foundations of Computational Mathematics , 17(2), 527--566

2017
[16]

Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization . In Proceedings of the 29th International Conference on Machine Learning

2012
[17]

& Monro, S

Robbins, H. & Monro, S. (1951). A Stochastic Approximation Method . The Annals of Mathematical Statistics , 22(3), 400--407

1951
[18]

Shamir, O. (2017). An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback . Journal of Machine Learning Research , 18(52), 1--11

2017
[19]

Wang, Y., Ye, H., Liu, Y., Dai, G., Tsang, I., & Wang, J. (2024). The Advancement in Stochastic Zeroth-Order Optimization: Mechanism of Accelerated Convergence of Gaussian Direction on Objectives with Skewed Hessian Eigenvalues . Unpublished manuscript

2024
[20]

Ye, H. (2026a). High-Probability Guarantees for Random Zeroth-Order Gradient Descent on Smooth Functions . arXiv preprint arXiv:2605.26547

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Ye, H. (2026b). Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback . arXiv preprint arXiv:2603.25029

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Agarwal, A., Dekel, O., & Xiao, L. (2010). Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback . In Proceedings of the 23rd Annual Conference on Learning Theory (pp.\ 28--40)

2010

[2] [2]

E., & Nocedal, J

Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization Methods for Large-Scale Machine Learning . SIAM Review , 60(2), 223--311

2018

[3] [3]

R., Scheinberg, K., & Vicente, L

Conn, A. R., Scheinberg, K., & Vicente, L. N. (2009). Introduction to Derivative-Free Optimization . Philadelphia, PA: SIAM

2009

[4] [4]

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs . arXiv preprint arXiv:2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

C., Jordan, M

Duchi, J. C., Jordan, M. I., Wainwright, M. J., & Wibisono, A. (2015). Optimal Rates for Zero-Order Convex Optimization: The Power of Two Function Evaluations . IEEE Transactions on Information Theory , 61(5), 2788--2806

2015

[6] [6]

D., Kalai, A

Flaxman, A. D., Kalai, A. T., & McMahan, H. B. (2005). Online Convex Optimization in the Bandit Setting: Gradient Descent without a Gradient . In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (pp.\ 385--394)

2005

[7] [7]

& Lan, G

Ghadimi, S. & Lan, G. (2013). Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming . SIAM Journal on Optimization , 23(4), 2341--2368

2013

[8] [8]

Gorbunov, E., Danilova, M., Shibaev, I., Dvurechensky, P., & Gasnikov, A. (2021). High-Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise . arXiv preprint arXiv:2106.05958

work page arXiv 2021

[9] [9]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models . In International Conference on Learning Representations

2022

[10] [10]

Kornilov, N., Shamir, O., Lobanov, A., Dvinskikh, D., Gasnikov, A., Shibaev, I., Gorbunov, E., & Horv \'a th, S. (2023). Accelerated Zeroth-Order Method for Non-Smooth Stochastic Convex Optimization Problem with Infinite Variance . In Advances in Neural Information Processing Systems , volume 36

2023

[11] [11]

& Massart, P

Laurent, B. & Massart, P. (2000). Adaptive Estimation of a Quadratic Functional by Model Selection . Annals of Statistics , 28(5), 1302--1338

2000

[12] [12]

& Zhou, Z

Liu, Z. & Zhou, Z. (2023). Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods . arXiv preprint arXiv:2312.08531

work page arXiv 2023

[13] [13]

D., Chen, D., & Arora, S

Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-Tuning Language Models with Just Forward Passes . arXiv preprint arXiv:2305.17333 . Accepted by NeurIPS 2023 (oral)

work page arXiv 2023

[14] [14]

Nemirovski, A., Juditsky, A., Lan, G., & Shapiro, A. (2009). Robust Stochastic Approximation Approach to Stochastic Programming . SIAM Journal on Optimization , 19(4), 1574--1609

2009

[15] [15]

& Spokoiny, V

Nesterov, Y. & Spokoiny, V. (2017). Random Gradient-Free Minimization of Convex Functions . Foundations of Computational Mathematics , 17(2), 527--566

2017

[16] [16]

Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization . In Proceedings of the 29th International Conference on Machine Learning

2012

[17] [17]

& Monro, S

Robbins, H. & Monro, S. (1951). A Stochastic Approximation Method . The Annals of Mathematical Statistics , 22(3), 400--407

1951

[18] [18]

Shamir, O. (2017). An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback . Journal of Machine Learning Research , 18(52), 1--11

2017

[19] [19]

Wang, Y., Ye, H., Liu, Y., Dai, G., Tsang, I., & Wang, J. (2024). The Advancement in Stochastic Zeroth-Order Optimization: Mechanism of Accelerated Convergence of Gaussian Direction on Objectives with Skewed Hessian Eigenvalues . Unpublished manuscript

2024

[20] [20]

Ye, H. (2026a). High-Probability Guarantees for Random Zeroth-Order Gradient Descent on Smooth Functions . arXiv preprint arXiv:2605.26547

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Ye, H. (2026b). Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback . arXiv preprint arXiv:2603.25029

work page internal anchor Pith review Pith/arXiv arXiv