Variance-Reduced Gradient Estimator for Nonconvex Zeroth-Order Distributed Optimization
Pith reviewed 2026-05-23 21:02 UTC · model grok-4.3
The pith
A Bernoulli-based variance-reduced estimator achieves O(d/ε) oracle complexity in distributed zeroth-order nonconvex optimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a variance-reduced zeroth-order gradient estimator that, with probability governed by a Bernoulli distribution, either renovates a single orthogonal direction or computes a full-dimensional correction, when integrated with gradient tracking, yields oracle complexity O(d/ε) for smooth nonconvex functions and O(dκ ln(1/ε)) for smooth gradient-dominated nonconvex functions.
What carries the argument
The variance-reduced gradient estimator that randomly selects between single-direction renovation and full-dimensional variance correction according to a Bernoulli distribution.
If this is right
- The per-iteration sampling cost is lower than that of 2d-point estimators while still guaranteeing the same convergence rate order.
- Gradient tracking compensates for the partial updates introduced by the single-direction renovation step.
- For gradient-dominated problems the complexity improves from linear to logarithmic dependence on 1/ε.
- Numerical comparisons with existing zeroth-order distributed methods show fewer total function calls on the tested instances.
Where Pith is reading between the lines
- The same Bernoulli choice could be tested in stochastic first-order settings to see whether similar variance reduction appears.
- The estimator might be combined with other communication-efficient mechanisms such as compressed gradients.
- Scaling the method to very large networks would test whether the per-node sampling savings remain dominant over communication overhead.
Load-bearing premise
The proposed estimator, when used with gradient tracking, produces no extra bias or variance that would prevent the stated oracle-complexity bounds from holding.
What would settle it
A numerical run on a smooth nonconvex test function in which the total number of function evaluations required to reach an ε-stationary point grows faster than O(d/ε) as ε decreases.
Figures
read the original abstract
This paper investigates distributed zeroth-order optimization for smooth nonconvex problems, targeting the trade-off between convergence rate and sampling cost per zeroth-order gradient estimation in current algorithms that use either the $2$-point or $2d$-point gradient estimators. We propose a novel variance-reduced gradient estimator that either randomly renovates a single orthogonal direction of the true gradient or calculates the gradient estimation across all dimensions for variance correction, based on a Bernoulli distribution. Integrating this estimator with gradient tracking mechanism allows us to address the trade-off. We show that the oracle complexity of our proposed algorithm is upper bounded by $O(d/\epsilon)$ for smooth nonconvex functions and by $O(d\kappa\ln (1/\epsilon))$ for smooth and gradient dominated nonconvex functions, where $d$ denotes the problem dimension and $\kappa$ is the condition number. Numerical simulations comparing our algorithm with existing methods confirm the effectiveness and efficiency of the proposed gradient estimator.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a variance-reduced zeroth-order gradient estimator for distributed nonconvex optimization that randomly selects (via Bernoulli) between single-direction renovation and full-dimensional correction, combined with gradient tracking. It claims this yields oracle complexity O(d/ε) to reach an ε-stationary point for smooth nonconvex objectives and O(dκ ln(1/ε)) for the gradient-dominated case, with supporting numerical simulations.
Significance. If the stated complexity bounds were valid, the work would represent a substantial improvement over existing zeroth-order methods by achieving dimension-linear dependence and sub-quadratic dependence on 1/ε. However, the central claim is inconsistent with known lower bounds, limiting its potential impact.
major comments (2)
- [Abstract] Abstract (and the main nonconvex convergence result): the claimed oracle complexity upper bound of O(d/ε) for smooth nonconvex functions to reach an ε-stationary point contradicts the information-theoretic lower bound of Ω(1/ε²) that holds even for first-order methods (Carmon et al., 2017; Nesterov, 2018). Zeroth-order estimators cannot circumvent this, and the Bernoulli single-direction renovation plus gradient tracking still requires Ω(1) queries per iteration on average; the analysis therefore contains a load-bearing error, most likely in the descent inequality, bias/variance accounting for the estimator, or telescoping of the tracking error.
- [§3] The variance-reduced estimator definition (likely §3): the Bernoulli choice between single-direction renovation and full correction is asserted to produce an unbiased estimator whose variance yields the improved rate, but no explicit bias calculation or variance bound is provided that would allow the O(d/ε) summation; this assumption is load-bearing for both complexity theorems.
minor comments (2)
- Notation for the distributed setting (e.g., local functions f_i, communication graph) is introduced without a dedicated preliminary section, making it difficult to track the precise assumptions on smoothness, gradient dominance, and network connectivity.
- The simulation section compares against baselines but does not report the precise values of d, ε, or number of nodes used to generate the plotted curves, hindering reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thorough review and for highlighting the inconsistency between our claimed complexity and known lower bounds. We acknowledge that the O(d/ε) bound for general smooth nonconvex problems cannot hold and likely stems from an error in the convergence analysis. We will revise the manuscript to correct this and strengthen the supporting calculations.
read point-by-point responses
-
Referee: [Abstract] Abstract (and the main nonconvex convergence result): the claimed oracle complexity upper bound of O(d/ε) for smooth nonconvex functions to reach an ε-stationary point contradicts the information-theoretic lower bound of Ω(1/ε²) that holds even for first-order methods (Carmon et al., 2017; Nesterov, 2018). Zeroth-order estimators cannot circumvent this, and the Bernoulli single-direction renovation plus gradient tracking still requires Ω(1) queries per iteration on average; the analysis therefore contains a load-bearing error, most likely in the descent inequality, bias/variance accounting for the estimator, or telescoping of the tracking error.
Authors: We agree that the stated O(d/ε) oracle complexity for general smooth nonconvex objectives is inconsistent with the Ω(1/ε²) lower bound. This indicates an error in our analysis, most probably in the descent inequality or the handling of the estimator's variance and bias when combined with gradient tracking. The Bernoulli mechanism does not reduce the per-iteration query cost below a constant on average, so it cannot improve the fundamental 1/ε² dependence. We will revise the main theorem to remove or correct this claim. The O(dκ ln(1/ε)) bound for the gradient-dominated case may remain valid under the PL-like assumption and will be re-verified separately. revision: yes
-
Referee: [§3] The variance-reduced estimator definition (likely §3): the Bernoulli choice between single-direction renovation and full correction is asserted to produce an unbiased estimator whose variance yields the improved rate, but no explicit bias calculation or variance bound is provided that would allow the O(d/ε) summation; this assumption is load-bearing for both complexity theorems.
Authors: We will add explicit derivations of the bias (which we believe is zero) and a tight variance bound for the proposed estimator in the revised §3. These calculations will be used to re-derive the complexity results under corrected assumptions, ensuring the supporting lemmas are fully rigorous. revision: yes
- The O(d/ε) complexity claim for smooth nonconvex functions cannot be defended, as it directly contradicts established lower bounds.
Circularity Check
No significant circularity detected
full rationale
The paper presents its oracle complexity bounds O(d/ε) and O(dκ ln(1/ε)) as outcomes of a convergence analysis applied to the proposed Bernoulli-based variance-reduced estimator integrated with gradient tracking. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the abstract or description. The derivation chain is presented as independent of the target bounds themselves, with numerical simulations offered as external validation rather than internal fitting. This is the most common honest finding for a theoretical analysis paper whose central claims rest on stated assumptions and telescoping arguments rather than tautological redefinitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Distributed smooth convex optimization with coupled constraints
Shu Liang, Le Yi Wang, and George Yin. Distributed smooth convex optimization with coupled constraints. IEEE Transactions on Automatic Control, 65(1):347–353, 2020
work page 2020
-
[2]
Distributed control for reaching optimal steady state in network systems: An optimization approach
Xuan Zhang, Antonis Papachristodoulou, and Na Li. Distributed control for reaching optimal steady state in network systems: An optimization approach. IEEE Transactions on Automatic Control , 63(3):864–871, 2018
work page 2018
-
[3]
Jie Liu, Daniel W. C. Ho, and Lulu Li. A generic algorithm framework for distributed optimization over the time-varying network with communication delays. IEEE Transactions on Automatic Control, 69(1):371–378, 2024
work page 2024
-
[4]
Distributed subgradient meth- ods for multi-agent optimization
Angelia Nedi ´c and Asuman Ozdaglar. Distributed subgradient meth- ods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009
work page 2009
-
[5]
EXTRA: An exact first-order algorithm for decentralized consensus optimization
Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization , 25(2):944–966, 2015
work page 2015
-
[6]
Angelia Nedi ´c. Distributed gradient methods for convex machine learning problems in networks: Distributed optimization. IEEE Signal Processing Magazine, 37(3):92–101, 2020
work page 2020
-
[7]
Non- convex distributed feedback optimization for aggregative cooperative robotics
Guido Carnevale, Nicola Mimmo, and Giuseppe Notarstefano. Non- convex distributed feedback optimization for aggregative cooperative robotics. Automatica, 167:111767, 2024
work page 2024
-
[8]
Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems , volume 30, 2017
work page 2017
-
[9]
On nonconvex decentralized gradient descent
Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions on signal processing , 66(11):2834–2848, 2018
work page 2018
-
[10]
Distributed nonconvex constrained optimization over time-varying digraphs
Gesualdo Scutari and Ying Sun. Distributed nonconvex constrained optimization over time-varying digraphs. Mathematical Programming, 176:497–544, 2019
work page 2019
-
[11]
Surrogate-based distributed optimisation for expensive black-box func- tions
Zhongguo Li, Zhen Dong, Zhongchao Liang, and Zhengtao Ding. Surrogate-based distributed optimisation for expensive black-box func- tions. Automatica, 125:109407, 2021
work page 2021
-
[12]
Regret analysis of stochastic and nonstochastic multi-armed bandit problems
S ´ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Founda- tions and Trends® in Machine Learning , 5(1):1–122, 2012
work page 2012
-
[13]
Lee, Danqi Chen, and Sanjeev Arora
Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. In Advances in Neural Information Process- ing Systems, volume 36, pages 53038–53075, 2023
work page 2023
-
[14]
Communication-Efficient Distributed Strongly Convex Stochastic Optimization: Non-Asymptotic Rates
Anit Kumar Sahu, Dusan Jakovetic, Dragana Bajovic, and Soummya Kar. Communication-efficient distributed strongly convex stochastic optimization: Non-asymptotic rates. arXiv preprint arXiv:1809.02920, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Random gradient-free mini- mization of convex functions
Yurii Nesterov and Vladimir Spokoiny. Random gradient-free mini- mization of convex functions. Foundations of Computational Mathe- matics, 17(2):527–566, 2017
work page 2017
-
[16]
Stochastic estimation of the maximum of a regression function
Jack Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, pages 462–466, 1952
work page 1952
-
[17]
Distributed zero-order algorithms for nonconvex multiagent optimization
Yujie Tang, Junshan Zhang, and Na Li. Distributed zero-order algorithms for nonconvex multiagent optimization. IEEE Transactions on Control of Network Systems , 8(1):269–281, 2021
work page 2021
-
[18]
Machine learning for variance reduction in online experiments
Yongyi Guo, Dominic Coey, Mikael Konutgan, Wenting Li, Chris Schoener, and Matt Goldman. Machine learning for variance reduction in online experiments. In Advances in Neural Information Processing Systems, volume 34, pages 8637–8648, 2021
work page 2021
-
[19]
Chong Wang, Xi Chen, Alexander J. Smola, and Eric P. Xing. Variance reduction for stochastic gradient optimization. In Advances in Neural Information Processing Systems , volume 26, 2013
work page 2013
-
[20]
Accelerating stochastic gradient descent using predictive variance reduction
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, volume 26, 2013
work page 2013
-
[21]
Zeroth-order stochastic variance reduction for nonconvex optimization
Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. In Advances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[22]
Variance-reduced decentralized stochastic optimization with accelerated convergence
Ran Xin, Usman A Khan, and Soummya Kar. Variance-reduced decentralized stochastic optimization with accelerated convergence. IEEE Transactions on Signal Processing , 68:6255–6271, 2020
work page 2020
-
[23]
ZONE: Zeroth-order nonconvex multiagent optimization over networks
Davood Hajinezhad, Mingyi Hong, and Alfredo Garcia. ZONE: Zeroth-order nonconvex multiagent optimization over networks. IEEE Transactions on Automatic Control , 64(10):3995–4010, 2019
work page 2019
-
[24]
Prob- lem Complexity and Method Efficiency in Optimization
Arkadij Semenovi ˇc Nemirovskij and David Borisovich Yudin. Prob- lem Complexity and Method Efficiency in Optimization . Wiley- Interscience, 1983
work page 1983
-
[25]
Online convex optimization in the bandit setting: gradient descent without a gradient
Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM- SIAM Symposium on Discrete Algorithms , pages 385–394, 2005
work page 2005
-
[26]
Harnessing smoothness to accelerate distributed optimization
Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3):1245–1260, 2017
work page 2017
-
[27]
Achieving geometric convergence for distributed optimization over time-varying graphs
Angelia Nedi ´c, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization , 27(4):2597–2633, 2017
work page 2017
-
[28]
Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 2nd edition, 2012
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.