pith. sign in

arxiv: 2409.19567 · v2 · submitted 2024-09-29 · 🧮 math.OC · cs.MA· cs.SY· eess.SY

Variance-Reduced Gradient Estimator for Nonconvex Zeroth-Order Distributed Optimization

Pith reviewed 2026-05-23 21:02 UTC · model grok-4.3

classification 🧮 math.OC cs.MAcs.SYeess.SY
keywords zeroth-order optimizationdistributed optimizationvariance reductionnonconvex optimizationgradient trackingoracle complexity
0
0 comments X

The pith

A Bernoulli-based variance-reduced estimator achieves O(d/ε) oracle complexity in distributed zeroth-order nonconvex optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a gradient estimator for distributed zeroth-order optimization that randomly chooses, via a Bernoulli draw, between renovating one orthogonal direction of the gradient or performing a full-dimensional correction. This choice reduces the number of function evaluations per step compared with always using 2d-point estimators while controlling variance. When the estimator is combined with a gradient tracking mechanism, the resulting algorithm attains an upper bound of O(d/ε) oracle calls to reach an ε-stationary point for smooth nonconvex objectives and O(dκ ln(1/ε)) calls when the objective is also gradient-dominated. The improvement matters because zeroth-order methods are used when explicit gradients are unavailable and each function evaluation is expensive in a distributed network.

Core claim

The paper claims that a variance-reduced zeroth-order gradient estimator that, with probability governed by a Bernoulli distribution, either renovates a single orthogonal direction or computes a full-dimensional correction, when integrated with gradient tracking, yields oracle complexity O(d/ε) for smooth nonconvex functions and O(dκ ln(1/ε)) for smooth gradient-dominated nonconvex functions.

What carries the argument

The variance-reduced gradient estimator that randomly selects between single-direction renovation and full-dimensional variance correction according to a Bernoulli distribution.

If this is right

  • The per-iteration sampling cost is lower than that of 2d-point estimators while still guaranteeing the same convergence rate order.
  • Gradient tracking compensates for the partial updates introduced by the single-direction renovation step.
  • For gradient-dominated problems the complexity improves from linear to logarithmic dependence on 1/ε.
  • Numerical comparisons with existing zeroth-order distributed methods show fewer total function calls on the tested instances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Bernoulli choice could be tested in stochastic first-order settings to see whether similar variance reduction appears.
  • The estimator might be combined with other communication-efficient mechanisms such as compressed gradients.
  • Scaling the method to very large networks would test whether the per-node sampling savings remain dominant over communication overhead.

Load-bearing premise

The proposed estimator, when used with gradient tracking, produces no extra bias or variance that would prevent the stated oracle-complexity bounds from holding.

What would settle it

A numerical run on a smooth nonconvex test function in which the total number of function evaluations required to reach an ε-stationary point grows faster than O(d/ε) as ε decreases.

Figures

Figures reproduced from arXiv: 2409.19567 by Huaiyi Mu, Jie Song, Yujie Tang, Zhongkui Li.

Figure 1
Figure 1. Figure 1: Convergence of Algorithm 1, ZONE-M with J =100, DGD-2p, [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Convergence of Algorithm 1 with different dimension [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

This paper investigates distributed zeroth-order optimization for smooth nonconvex problems, targeting the trade-off between convergence rate and sampling cost per zeroth-order gradient estimation in current algorithms that use either the $2$-point or $2d$-point gradient estimators. We propose a novel variance-reduced gradient estimator that either randomly renovates a single orthogonal direction of the true gradient or calculates the gradient estimation across all dimensions for variance correction, based on a Bernoulli distribution. Integrating this estimator with gradient tracking mechanism allows us to address the trade-off. We show that the oracle complexity of our proposed algorithm is upper bounded by $O(d/\epsilon)$ for smooth nonconvex functions and by $O(d\kappa\ln (1/\epsilon))$ for smooth and gradient dominated nonconvex functions, where $d$ denotes the problem dimension and $\kappa$ is the condition number. Numerical simulations comparing our algorithm with existing methods confirm the effectiveness and efficiency of the proposed gradient estimator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a variance-reduced zeroth-order gradient estimator for distributed nonconvex optimization that randomly selects (via Bernoulli) between single-direction renovation and full-dimensional correction, combined with gradient tracking. It claims this yields oracle complexity O(d/ε) to reach an ε-stationary point for smooth nonconvex objectives and O(dκ ln(1/ε)) for the gradient-dominated case, with supporting numerical simulations.

Significance. If the stated complexity bounds were valid, the work would represent a substantial improvement over existing zeroth-order methods by achieving dimension-linear dependence and sub-quadratic dependence on 1/ε. However, the central claim is inconsistent with known lower bounds, limiting its potential impact.

major comments (2)
  1. [Abstract] Abstract (and the main nonconvex convergence result): the claimed oracle complexity upper bound of O(d/ε) for smooth nonconvex functions to reach an ε-stationary point contradicts the information-theoretic lower bound of Ω(1/ε²) that holds even for first-order methods (Carmon et al., 2017; Nesterov, 2018). Zeroth-order estimators cannot circumvent this, and the Bernoulli single-direction renovation plus gradient tracking still requires Ω(1) queries per iteration on average; the analysis therefore contains a load-bearing error, most likely in the descent inequality, bias/variance accounting for the estimator, or telescoping of the tracking error.
  2. [§3] The variance-reduced estimator definition (likely §3): the Bernoulli choice between single-direction renovation and full correction is asserted to produce an unbiased estimator whose variance yields the improved rate, but no explicit bias calculation or variance bound is provided that would allow the O(d/ε) summation; this assumption is load-bearing for both complexity theorems.
minor comments (2)
  1. Notation for the distributed setting (e.g., local functions f_i, communication graph) is introduced without a dedicated preliminary section, making it difficult to track the precise assumptions on smoothness, gradient dominance, and network connectivity.
  2. The simulation section compares against baselines but does not report the precise values of d, ε, or number of nodes used to generate the plotted curves, hindering reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thorough review and for highlighting the inconsistency between our claimed complexity and known lower bounds. We acknowledge that the O(d/ε) bound for general smooth nonconvex problems cannot hold and likely stems from an error in the convergence analysis. We will revise the manuscript to correct this and strengthen the supporting calculations.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and the main nonconvex convergence result): the claimed oracle complexity upper bound of O(d/ε) for smooth nonconvex functions to reach an ε-stationary point contradicts the information-theoretic lower bound of Ω(1/ε²) that holds even for first-order methods (Carmon et al., 2017; Nesterov, 2018). Zeroth-order estimators cannot circumvent this, and the Bernoulli single-direction renovation plus gradient tracking still requires Ω(1) queries per iteration on average; the analysis therefore contains a load-bearing error, most likely in the descent inequality, bias/variance accounting for the estimator, or telescoping of the tracking error.

    Authors: We agree that the stated O(d/ε) oracle complexity for general smooth nonconvex objectives is inconsistent with the Ω(1/ε²) lower bound. This indicates an error in our analysis, most probably in the descent inequality or the handling of the estimator's variance and bias when combined with gradient tracking. The Bernoulli mechanism does not reduce the per-iteration query cost below a constant on average, so it cannot improve the fundamental 1/ε² dependence. We will revise the main theorem to remove or correct this claim. The O(dκ ln(1/ε)) bound for the gradient-dominated case may remain valid under the PL-like assumption and will be re-verified separately. revision: yes

  2. Referee: [§3] The variance-reduced estimator definition (likely §3): the Bernoulli choice between single-direction renovation and full correction is asserted to produce an unbiased estimator whose variance yields the improved rate, but no explicit bias calculation or variance bound is provided that would allow the O(d/ε) summation; this assumption is load-bearing for both complexity theorems.

    Authors: We will add explicit derivations of the bias (which we believe is zero) and a tight variance bound for the proposed estimator in the revised §3. These calculations will be used to re-derive the complexity results under corrected assumptions, ensuring the supporting lemmas are fully rigorous. revision: yes

standing simulated objections not resolved
  • The O(d/ε) complexity claim for smooth nonconvex functions cannot be defended, as it directly contradicts established lower bounds.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents its oracle complexity bounds O(d/ε) and O(dκ ln(1/ε)) as outcomes of a convergence analysis applied to the proposed Bernoulli-based variance-reduced estimator integrated with gradient tracking. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the abstract or description. The derivation chain is presented as independent of the target bounds themselves, with numerical simulations offered as external validation rather than internal fitting. This is the most common honest finding for a theoretical analysis paper whose central claims rest on stated assumptions and telescoping arguments rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the Bernoulli mechanism is part of the algorithmic design rather than a fitted quantity.

pith-pipeline@v0.9.0 · 5708 in / 1237 out tokens · 47455 ms · 2026-05-23T21:02:04.306112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Distributed smooth convex optimization with coupled constraints

    Shu Liang, Le Yi Wang, and George Yin. Distributed smooth convex optimization with coupled constraints. IEEE Transactions on Automatic Control, 65(1):347–353, 2020

  2. [2]

    Distributed control for reaching optimal steady state in network systems: An optimization approach

    Xuan Zhang, Antonis Papachristodoulou, and Na Li. Distributed control for reaching optimal steady state in network systems: An optimization approach. IEEE Transactions on Automatic Control , 63(3):864–871, 2018

  3. [3]

    Jie Liu, Daniel W. C. Ho, and Lulu Li. A generic algorithm framework for distributed optimization over the time-varying network with communication delays. IEEE Transactions on Automatic Control, 69(1):371–378, 2024

  4. [4]

    Distributed subgradient meth- ods for multi-agent optimization

    Angelia Nedi ´c and Asuman Ozdaglar. Distributed subgradient meth- ods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009

  5. [5]

    EXTRA: An exact first-order algorithm for decentralized consensus optimization

    Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization , 25(2):944–966, 2015

  6. [6]

    Distributed gradient methods for convex machine learning problems in networks: Distributed optimization

    Angelia Nedi ´c. Distributed gradient methods for convex machine learning problems in networks: Distributed optimization. IEEE Signal Processing Magazine, 37(3):92–101, 2020

  7. [7]

    Non- convex distributed feedback optimization for aggregative cooperative robotics

    Guido Carnevale, Nicola Mimmo, and Giuseppe Notarstefano. Non- convex distributed feedback optimization for aggregative cooperative robotics. Automatica, 167:111767, 2024

  8. [8]

    Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent

    Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems , volume 30, 2017

  9. [9]

    On nonconvex decentralized gradient descent

    Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions on signal processing , 66(11):2834–2848, 2018

  10. [10]

    Distributed nonconvex constrained optimization over time-varying digraphs

    Gesualdo Scutari and Ying Sun. Distributed nonconvex constrained optimization over time-varying digraphs. Mathematical Programming, 176:497–544, 2019

  11. [11]

    Surrogate-based distributed optimisation for expensive black-box func- tions

    Zhongguo Li, Zhen Dong, Zhongchao Liang, and Zhengtao Ding. Surrogate-based distributed optimisation for expensive black-box func- tions. Automatica, 125:109407, 2021

  12. [12]

    Regret analysis of stochastic and nonstochastic multi-armed bandit problems

    S ´ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Founda- tions and Trends® in Machine Learning , 5(1):1–122, 2012

  13. [13]

    Lee, Danqi Chen, and Sanjeev Arora

    Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. In Advances in Neural Information Process- ing Systems, volume 36, pages 53038–53075, 2023

  14. [14]

    Communication-Efficient Distributed Strongly Convex Stochastic Optimization: Non-Asymptotic Rates

    Anit Kumar Sahu, Dusan Jakovetic, Dragana Bajovic, and Soummya Kar. Communication-efficient distributed strongly convex stochastic optimization: Non-asymptotic rates. arXiv preprint arXiv:1809.02920, 2018

  15. [15]

    Random gradient-free mini- mization of convex functions

    Yurii Nesterov and Vladimir Spokoiny. Random gradient-free mini- mization of convex functions. Foundations of Computational Mathe- matics, 17(2):527–566, 2017

  16. [16]

    Stochastic estimation of the maximum of a regression function

    Jack Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, pages 462–466, 1952

  17. [17]

    Distributed zero-order algorithms for nonconvex multiagent optimization

    Yujie Tang, Junshan Zhang, and Na Li. Distributed zero-order algorithms for nonconvex multiagent optimization. IEEE Transactions on Control of Network Systems , 8(1):269–281, 2021

  18. [18]

    Machine learning for variance reduction in online experiments

    Yongyi Guo, Dominic Coey, Mikael Konutgan, Wenting Li, Chris Schoener, and Matt Goldman. Machine learning for variance reduction in online experiments. In Advances in Neural Information Processing Systems, volume 34, pages 8637–8648, 2021

  19. [19]

    Smola, and Eric P

    Chong Wang, Xi Chen, Alexander J. Smola, and Eric P. Xing. Variance reduction for stochastic gradient optimization. In Advances in Neural Information Processing Systems , volume 26, 2013

  20. [20]

    Accelerating stochastic gradient descent using predictive variance reduction

    Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, volume 26, 2013

  21. [21]

    Zeroth-order stochastic variance reduction for nonconvex optimization

    Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. In Advances in Neural Information Processing Systems, volume 31, 2018

  22. [22]

    Variance-reduced decentralized stochastic optimization with accelerated convergence

    Ran Xin, Usman A Khan, and Soummya Kar. Variance-reduced decentralized stochastic optimization with accelerated convergence. IEEE Transactions on Signal Processing , 68:6255–6271, 2020

  23. [23]

    ZONE: Zeroth-order nonconvex multiagent optimization over networks

    Davood Hajinezhad, Mingyi Hong, and Alfredo Garcia. ZONE: Zeroth-order nonconvex multiagent optimization over networks. IEEE Transactions on Automatic Control , 64(10):3995–4010, 2019

  24. [24]

    Prob- lem Complexity and Method Efficiency in Optimization

    Arkadij Semenovi ˇc Nemirovskij and David Borisovich Yudin. Prob- lem Complexity and Method Efficiency in Optimization . Wiley- Interscience, 1983

  25. [25]

    Online convex optimization in the bandit setting: gradient descent without a gradient

    Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM- SIAM Symposium on Discrete Algorithms , pages 385–394, 2005

  26. [26]

    Harnessing smoothness to accelerate distributed optimization

    Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3):1245–1260, 2017

  27. [27]

    Achieving geometric convergence for distributed optimization over time-varying graphs

    Angelia Nedi ´c, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization , 27(4):2597–2633, 2017

  28. [28]

    Horn and Charles R

    Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 2nd edition, 2012