Distributed Zeroth-Order Optimization with Rademacher Perturbations and Momentum Gradient Tracking
Pith reviewed 2026-05-09 21:45 UTC · model grok-4.3
The pith
A distributed zeroth-order method with momentum gradient tracking converges at rate O(1/T) and suppresses bias quadratically.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that ZO-MGT, by integrating momentum-based variance reduction with dynamic gradient tracking and Rademacher perturbations, achieves an O(1/T) convergence rate in non-convex distributed zeroth-order optimization. Moreover, the heterogeneity-induced bias can be suppressed at a quadratic rate of O((1-β)^2) by choosing a large momentum parameter β. This is accomplished while using only two function evaluations per iteration and avoiding batch sampling.
What carries the argument
The ZO-MGT update rule that applies momentum to reduce variance in zeroth-order gradient estimates while dynamically tracking gradients across nodes to eliminate structural biases.
Load-bearing premise
The proofs depend on standard assumptions of smoothness and bounded variance for the objective functions along with proper selection of algorithm parameters like step sizes and the momentum factor.
What would settle it
Observing that the bias does not diminish quadratically as the momentum parameter β increases toward one, or finding a convergence rate slower than O(1/T) in a setting satisfying the smoothness and bounded variance conditions, would falsify the claims.
Figures
read the original abstract
Zeroth-order (ZO) optimization is indispensable for complex non-convex tasks where explicit gradients are computationally prohibitive or strictly inaccessible. For deploying ZO methods over distributed heterogeneous networks, the gradient tracking technique is often employed to eliminate structural data biases. However, the inherent variance of derivative-free estimators is also amplified. To overcome this problem, we propose Zeroth-Order Momentum Gradient Tracking (ZO-MGT), which integrates momentum-based variance reduction with dynamic gradient tracking. Specifically, ZO-MGT that requires exactly two function queries per iteration can avoid costly batch sampling and prevent variance explosion, while eliminating structural biases. Moreover, by utilizing Rademacher perturbations, it preserves optimal query efficiency and enables bitwise hardware acceleration. We theoretically analyze the convergence of ZO-MGT and establish an $\mathcal{O}(1/T)$ convergence rate. Furthermore, we prove that a large momentum factor can aggressively suppress the heterogeneity-induced bias floor at a remarkable quadratic rate of $\mathcal{O}((1-\beta)^2)$. Numerical experiments under extreme data heterogeneity verify that ZO-MGT can effectively overcome traditional tracking failures with accelerated convergence guarantees, while achieving significantly tighter consensus.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Zeroth-Order Momentum Gradient Tracking (ZO-MGT), a distributed zeroth-order method that integrates Rademacher perturbations (two function queries per iteration) with momentum-based variance reduction and gradient tracking to handle heterogeneous networks. It claims an O(1/T) convergence rate for non-convex objectives and proves that large momentum parameter β suppresses the heterogeneity-induced bias floor at a quadratic rate O((1-β)^2). Numerical experiments under extreme data heterogeneity are reported to confirm accelerated convergence and tighter consensus compared to standard tracking approaches.
Significance. If the stated rates and quadratic bias suppression hold under standard smoothness and bounded-variance assumptions, the work provides a query-efficient ZO algorithm with explicit bias control for distributed heterogeneous settings, which is relevant for non-convex machine learning tasks where gradients are unavailable. The Rademacher perturbation choice for bitwise acceleration is a practical strength. The result would strengthen the literature on variance-reduced ZO methods with tracking, but its impact hinges on verifying the novel quadratic scaling mechanism against standard recurrence analysis.
major comments (2)
- [Abstract and theoretical analysis] Abstract and theoretical analysis section: the claimed O((1-β)^2) quadratic suppression of the heterogeneity bias floor is not guaranteed by standard momentum gradient tracking recurrences. The steady-state tracking error typically satisfies e ≤ β e + α b for constant heterogeneity bias b, yielding a floor of order α b / (1-β); an extra (1-β) factor to reach quadratic scaling requires either b vanishing linearly with (1-β) or an additional cancellation not implied by smoothness alone. The manuscript must exhibit the precise recurrence or assumption (e.g., on perturbation radius μ relative to β or network mixing) that produces this scaling, as the abstract states the rate without derivation steps or error-bound details.
- [Convergence theorem] Convergence theorem (likely §4): the O(1/T) rate and bias-suppression result are stated without an explicit assumption list, step-size schedule, or bound on the ZO estimator variance in the provided text. This prevents verification that the rates hold under the “standard smoothness and bounded-variance conditions” referenced in the reader’s note, especially when the central claim of quadratic suppression is load-bearing for the paper’s novelty.
minor comments (1)
- [Abstract] The abstract mentions numerical experiments but supplies no setup details, baselines, or quantitative metrics; adding a brief summary of these in the abstract or §5 would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the manuscript to improve clarity and completeness of the theoretical analysis.
read point-by-point responses
-
Referee: [Abstract and theoretical analysis] Abstract and theoretical analysis section: the claimed O((1-β)^2) quadratic suppression of the heterogeneity bias floor is not guaranteed by standard momentum gradient tracking recurrences. The steady-state tracking error typically satisfies e ≤ β e + α b for constant heterogeneity bias b, yielding a floor of order α b / (1-β); an extra (1-β) factor to reach quadratic scaling requires either b vanishing linearly with (1-β) or an additional cancellation not implied by smoothness alone. The manuscript must exhibit the precise recurrence or assumption (e.g., on perturbation radius μ relative to β or network mixing) that produces this scaling, as the abstract states the rate without derivation steps or error-bound details.
Authors: We appreciate the referee highlighting the need for explicit derivation. In our analysis, the quadratic O((1-β)^2) suppression emerges from the coupled dynamics of momentum-based variance reduction and gradient tracking with Rademacher perturbations: the heterogeneity bias term is attenuated once by the momentum factor in the variance-reduced estimator and again through the tracking update, where the perturbation radius μ is chosen proportionally to (1-β) to ensure the extra cancellation. This yields a tighter recurrence than the standard e ≤ β e + α b form. We will insert the precise recurrence relation, the assumption linking μ to β, and the full error-bound derivation into the revised theoretical analysis section. revision: yes
-
Referee: [Convergence theorem] Convergence theorem (likely §4): the O(1/T) rate and bias-suppression result are stated without an explicit assumption list, step-size schedule, or bound on the ZO estimator variance in the provided text. This prevents verification that the rates hold under the “standard smoothness and bounded-variance conditions” referenced in the reader’s note, especially when the central claim of quadratic suppression is load-bearing for the paper’s novelty.
Authors: We agree that the theorem presentation should be fully self-contained. The analysis relies on standard L-smoothness, bounded second-moment of the Rademacher-based ZO estimator (with the explicit variance bound derived in the appendix), and a connected undirected network with spectral gap. The step-size schedule is α = Θ(1/√T) (with appropriate β-dependent constants) to obtain the O(1/T) rate. We will add a dedicated assumption list immediately before the main theorem, state the step-size schedule and ZO variance bound explicitly in the theorem statement, and reference the appendix for the full proof details in the revised manuscript. revision: yes
Circularity Check
No significant circularity; rates derived from standard analysis
full rationale
The paper presents a theoretical convergence analysis for ZO-MGT under standard smoothness and bounded-variance assumptions, claiming O(1/T) rate and O((1-β)^2) bias suppression. No equations, derivations, or self-citations in the abstract reduce these claims to fitted inputs, self-definitions, or prior author results by construction. The derivation chain is presented as independent mathematical proof rather than tautological renaming or load-bearing self-reference. This is the expected outcome for a standard convergence proof paper.
Axiom & Free-Parameter Ledger
free parameters (1)
- momentum factor β
axioms (1)
- domain assumption Objective function is smooth with bounded gradient variance or similar standard conditions for ZO convergence proofs.
Reference graph
Works this paper leans on
-
[1]
Distributed subgradient methods for multi- agent optimization,
A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi- agent optimization,”IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, Jan. 2009
work page 2009
-
[2]
On the linear convergence of the ADMM in decentralized consensus optimization,
W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1751–1761, Apr. 2014
work page 2014
-
[3]
Linear convergence rate of a class of distributed augmented Lagrangian algorithms,
D. Jakoveti ´c, J. M. F. Moura, and J. Xavier, “Linear convergence rate of a class of distributed augmented Lagrangian algorithms,”IEEE Transactions on Automatic Control, vol. 60, no. 4, pp. 922–936, Apr. 2015
work page 2015
-
[4]
EXTRA: An exact first-order algorithm for decentralized consensus optimization,
W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,”SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, Jan. 2015
work page 2015
-
[5]
Exact diffusion for de- centralized optimization and learning—Part I: Algorithm and analysis,
K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for de- centralized optimization and learning—Part I: Algorithm and analysis,” IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 708–723, Feb. 2018
work page 2018
-
[6]
Harnessing smoothness to accelerate distributed optimization,
G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,”IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, Sep. 2018
work page 2018
-
[7]
Achieving geometric convergence for distributed optimization over time-varying graphs,
A. Nedi ´c, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,”SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, Jan. 2017
work page 2017
-
[8]
Distributed stochastic gradient tracking methods,
S. Pu and A. Nedi ´c, “Distributed stochastic gradient tracking methods,” Mathematical Programming, vol. 187, no. 1, pp. 409–457, May 2021
work page 2021
-
[9]
S. Liu, P.-Y . Chen, B. Kailkhura, G. Zhang, A. O. Hero, and P. K. Varshney, “A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications,” IEEE Signal Processing Magazine, vol. 37, no. 5, pp. 43–54, Sep. 2020. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13
work page 2020
-
[10]
Random gradient-free minimization of convex functions,
Y . Nesterov and V . Spokoiny, “Random gradient-free minimization of convex functions,”Foundations of Computational Mathematics, vol. 17, no. 2, pp. 527–566, Apr. 2017
work page 2017
-
[11]
Fine-tuning language models with just forward passes,
S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora, “Fine-tuning language models with just forward passes,”Ad- vances in Neural Information Processing Systems, vol. 36, pp. 53 038– 53 075, Dec. 2023
work page 2023
-
[12]
Y . Zhao, S. Dang, H. Ye, G. Dai, Y . Qian, and I. W. Tsang, “Second- order fine-tuning without pain for LLMs: A hessian informed zeroth- order optimizer,”arXiv preprint arXiv:2402.15173, Feb. 2025
-
[13]
Why does adaptive zeroth-order optimization work?
H. Ye and L. Luo, “Why does adaptive zeroth-order optimization work?” arXiv preprint arXiv:2602.01627, Feb. 2026
-
[14]
P.-Y . Chen, H. Zhang, Y . Sharma, J. Yi, and C.-J. Hsieh, “ZOO: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” inProceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Nov. 2017, pp. 15–26
work page 2017
-
[15]
Global convergence of policy gradient methods for the linear quadratic regulator,
M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” inIn- ternational Conference on Machine Learning. PMLR, Jul. 2018, pp. 1467–1476
work page 2018
-
[16]
FZOO: Fast zeroth-order optimizer for fine-tuning large language models towards adam-scale speed,
S. Dang, Y . Guo, Y . Zhao, H. Ye, X. Zheng, G. Dai, and I. Tsang, “FZOO: Fast zeroth-order optimizer for fine-tuning large language models towards adam-scale speed,”arXiv preprint arXiv:2506.09034, Jun. 2025
-
[17]
Distributed zero-order algorithms for nonconvex multiagent optimization,
Y . Tang, J. Zhang, and N. Li, “Distributed zero-order algorithms for nonconvex multiagent optimization,”IEEE Transactions on Control of Network Systems, vol. 8, no. 1, pp. 269–281, Mar. 2021
work page 2021
-
[18]
Zeroth-order algorithms for stochastic distributed nonconvex optimization,
X. Yi, S. Zhang, T. Yang, and K. H. Johansson, “Zeroth-order algorithms for stochastic distributed nonconvex optimization,”Automatica, vol. 142, p. 110353, Aug. 2022
work page 2022
-
[19]
Communication-efficient stochastic zeroth-order optimization for fed- erated learning,
W. Fang, Z. Yu, Y . Jiang, Y . Shi, C. N. Jones, and Y . Zhou, “Communication-efficient stochastic zeroth-order optimization for fed- erated learning,”IEEE Transactions on Signal Processing, vol. 70, pp. 5058–5073, Oct. 2022
work page 2022
-
[20]
Adaptive and communication- efficient zeroth-order optimization for distributed internet of things,
Q. Dang, S. Yang, Q. Liu, and J. Ruan, “Adaptive and communication- efficient zeroth-order optimization for distributed internet of things,” IEEE Internet of Things Journal, vol. 11, no. 22, pp. 37 200–37 213, Nov. 2024
work page 2024
-
[21]
Zeroth-order feedback-based optimization for distributed demand response,
R. Jin, Y . Tang, and J. Song, “Zeroth-order feedback-based optimization for distributed demand response,”arXiv preprint arXiv:2311.00372, Nov. 2023
-
[22]
Heterogeneous distributed zeroth-order nonconvex optimization with communication compression,
H. Wang, X. Yi, Y . Hong, and M. Liwang, “Heterogeneous distributed zeroth-order nonconvex optimization with communication compression,” arXiv preprint arXiv:2602.08659, Feb. 2026
-
[23]
Distributed zeroth-order gradi- ent tracking for weakly convex optimization over unbalanced graphs,
R. Wang, S. Cheng, Y . Fan, and J. Qiu, “Distributed zeroth-order gradi- ent tracking for weakly convex optimization over unbalanced graphs,” IEEE Transactions on Signal and Information Processing over Networks, vol. 11, pp. 1515–1526, Jan. 2025
work page 2025
-
[24]
Compressed distributed zeroth-order gradient tracking for nonconvex optimization,
L. Xu, X. Yi, J. Sun, G. Wen, T. Chai, and T. Yang, “Compressed distributed zeroth-order gradient tracking for nonconvex optimization,” in2024 IEEE 18th International Conference on Control & Automation (ICCA), Jun. 2024, pp. 603–608
work page 2024
-
[25]
E. Mhanna and M. Assaad, “Single point-based distributed zeroth- order optimization with a non-convex stochastic objective function,” in Proceedings of the 40th International Conference on Machine Learning (ICML). PMLR, Jul. 2023, pp. 24 701–24 719
work page 2023
-
[26]
Decentralized gradient-free meth- ods for stochastic non-smooth non-convex optimization,
Z. Lin, J. Xia, Q. Deng, and L. Luo, “Decentralized gradient-free meth- ods for stochastic non-smooth non-convex optimization,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, Mar. 2024, pp. 17 477–17 486
work page 2024
-
[27]
Variance-reduced gradient estimator for nonconvex zeroth-order distributed optimization,
H. Mu, Y . Tang, and Z. Li, “Variance-reduced gradient estimator for nonconvex zeroth-order distributed optimization,” in2025 American Control Conference (ACC), Jul. 2025, pp. 1667–1674
work page 2025
-
[28]
Zeroth-order stochastic variance reduction for nonconvex optimization,
S. Liu, B. Kailkhura, P.-Y . Chen, P. Ting, S. Chang, and L. Amini, “Zeroth-order stochastic variance reduction for nonconvex optimization,” inAdvances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., 2018
work page 2018
-
[29]
Improved zeroth-order vari- ance reduced algorithms and analysis for nonconvex optimization,
K. Ji, Z. Wang, Y . Zhou, and Y . Liang, “Improved zeroth-order vari- ance reduced algorithms and analysis for nonconvex optimization,” in Proceedings of the 36th International Conference on Machine Learning (ICML). PMLR, 2019, pp. 3100–3109
work page 2019
-
[30]
ZIVR: An incremental variance reduc- tion technique for zeroth-order composite problems,
S. Zhang and Y . Tang, “ZIVR: An incremental variance reduc- tion technique for zeroth-order composite problems,”arXiv preprint arXiv:2601.05056, Jan. 2026
-
[31]
Momentum-based variance reduction in non-convex SGD,
A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex SGD,” inAdvances in Neural Information Processing Systems, vol. 32, Dec. 2019
work page 2019
-
[32]
ZO-AdaMM: Zeroth-order adaptive momentum method for black-box optimization,
X. Chen, S. Liu, K. Xu, X. Li, X. Lin, M. Hong, and D. Cox, “ZO-AdaMM: Zeroth-order adaptive momentum method for black-box optimization,” inAdvances in Neural Information Processing Systems, vol. 32, Dec. 2019
work page 2019
-
[33]
Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization,
F. Huang, S. Gao, J. Pei, and H. Huang, “Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization,” Journal of Machine Learning Research, vol. 23, no. 36, pp. 1–70, 2022
work page 2022
-
[34]
Momentum-based zeroth-order gradient method for distributed black-box optimization,
Q. Dang, B. Li, and X. He, “Momentum-based zeroth-order gradient method for distributed black-box optimization,”IEEE Transactions on Artificial Intelligence, 2026. Yanxu Su(Member, IEEE) received his B.E. and M.E. degrees from the College of Automation Engi- neering, Nanjing University of Aeronautics and As- tronautics, Nanjing, China, in Control Engineer...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.