pith. sign in

arxiv: 2604.21368 · v1 · submitted 2026-04-23 · 🧮 math.OC

Distributed Zeroth-Order Optimization with Rademacher Perturbations and Momentum Gradient Tracking

Pith reviewed 2026-05-09 21:45 UTC · model grok-4.3

classification 🧮 math.OC
keywords zeroth-order optimizationdistributed optimizationgradient trackingmomentumRademacher perturbationsnon-convex problemsheterogeneous datavariance reduction
0
0 comments X

The pith

A distributed zeroth-order method with momentum gradient tracking converges at rate O(1/T) and suppresses bias quadratically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes Zeroth-Order Momentum Gradient Tracking (ZO-MGT) to solve non-convex distributed optimization problems without access to explicit gradients. The approach combines momentum for variance reduction with gradient tracking to counter data heterogeneity while keeping variance in check. It performs the task with precisely two function queries each iteration using Rademacher perturbations. Analysis proves an overall O(1/T) convergence rate and shows that larger momentum factors reduce the bias floor at a quadratic rate O((1-β)^2). Tests on highly heterogeneous data confirm better performance than prior tracking methods.

Core claim

The central discovery is that ZO-MGT, by integrating momentum-based variance reduction with dynamic gradient tracking and Rademacher perturbations, achieves an O(1/T) convergence rate in non-convex distributed zeroth-order optimization. Moreover, the heterogeneity-induced bias can be suppressed at a quadratic rate of O((1-β)^2) by choosing a large momentum parameter β. This is accomplished while using only two function evaluations per iteration and avoiding batch sampling.

What carries the argument

The ZO-MGT update rule that applies momentum to reduce variance in zeroth-order gradient estimates while dynamically tracking gradients across nodes to eliminate structural biases.

Load-bearing premise

The proofs depend on standard assumptions of smoothness and bounded variance for the objective functions along with proper selection of algorithm parameters like step sizes and the momentum factor.

What would settle it

Observing that the bias does not diminish quadratically as the momentum parameter β increases toward one, or finding a convergence rate slower than O(1/T) in a setting satisfying the smoothness and bounded variance conditions, would falsify the claims.

Figures

Figures reproduced from arXiv: 2604.21368 by Changyin Sun, Xiaorui Tong, Yanxu Su.

Figure 1
Figure 1. Figure 1: Convergence evaluation with respect to iterations. [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sensitivity analysis of the momentum factor [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
read the original abstract

Zeroth-order (ZO) optimization is indispensable for complex non-convex tasks where explicit gradients are computationally prohibitive or strictly inaccessible. For deploying ZO methods over distributed heterogeneous networks, the gradient tracking technique is often employed to eliminate structural data biases. However, the inherent variance of derivative-free estimators is also amplified. To overcome this problem, we propose Zeroth-Order Momentum Gradient Tracking (ZO-MGT), which integrates momentum-based variance reduction with dynamic gradient tracking. Specifically, ZO-MGT that requires exactly two function queries per iteration can avoid costly batch sampling and prevent variance explosion, while eliminating structural biases. Moreover, by utilizing Rademacher perturbations, it preserves optimal query efficiency and enables bitwise hardware acceleration. We theoretically analyze the convergence of ZO-MGT and establish an $\mathcal{O}(1/T)$ convergence rate. Furthermore, we prove that a large momentum factor can aggressively suppress the heterogeneity-induced bias floor at a remarkable quadratic rate of $\mathcal{O}((1-\beta)^2)$. Numerical experiments under extreme data heterogeneity verify that ZO-MGT can effectively overcome traditional tracking failures with accelerated convergence guarantees, while achieving significantly tighter consensus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Zeroth-Order Momentum Gradient Tracking (ZO-MGT), a distributed zeroth-order method that integrates Rademacher perturbations (two function queries per iteration) with momentum-based variance reduction and gradient tracking to handle heterogeneous networks. It claims an O(1/T) convergence rate for non-convex objectives and proves that large momentum parameter β suppresses the heterogeneity-induced bias floor at a quadratic rate O((1-β)^2). Numerical experiments under extreme data heterogeneity are reported to confirm accelerated convergence and tighter consensus compared to standard tracking approaches.

Significance. If the stated rates and quadratic bias suppression hold under standard smoothness and bounded-variance assumptions, the work provides a query-efficient ZO algorithm with explicit bias control for distributed heterogeneous settings, which is relevant for non-convex machine learning tasks where gradients are unavailable. The Rademacher perturbation choice for bitwise acceleration is a practical strength. The result would strengthen the literature on variance-reduced ZO methods with tracking, but its impact hinges on verifying the novel quadratic scaling mechanism against standard recurrence analysis.

major comments (2)
  1. [Abstract and theoretical analysis] Abstract and theoretical analysis section: the claimed O((1-β)^2) quadratic suppression of the heterogeneity bias floor is not guaranteed by standard momentum gradient tracking recurrences. The steady-state tracking error typically satisfies e ≤ β e + α b for constant heterogeneity bias b, yielding a floor of order α b / (1-β); an extra (1-β) factor to reach quadratic scaling requires either b vanishing linearly with (1-β) or an additional cancellation not implied by smoothness alone. The manuscript must exhibit the precise recurrence or assumption (e.g., on perturbation radius μ relative to β or network mixing) that produces this scaling, as the abstract states the rate without derivation steps or error-bound details.
  2. [Convergence theorem] Convergence theorem (likely §4): the O(1/T) rate and bias-suppression result are stated without an explicit assumption list, step-size schedule, or bound on the ZO estimator variance in the provided text. This prevents verification that the rates hold under the “standard smoothness and bounded-variance conditions” referenced in the reader’s note, especially when the central claim of quadratic suppression is load-bearing for the paper’s novelty.
minor comments (1)
  1. [Abstract] The abstract mentions numerical experiments but supplies no setup details, baselines, or quantitative metrics; adding a brief summary of these in the abstract or §5 would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the manuscript to improve clarity and completeness of the theoretical analysis.

read point-by-point responses
  1. Referee: [Abstract and theoretical analysis] Abstract and theoretical analysis section: the claimed O((1-β)^2) quadratic suppression of the heterogeneity bias floor is not guaranteed by standard momentum gradient tracking recurrences. The steady-state tracking error typically satisfies e ≤ β e + α b for constant heterogeneity bias b, yielding a floor of order α b / (1-β); an extra (1-β) factor to reach quadratic scaling requires either b vanishing linearly with (1-β) or an additional cancellation not implied by smoothness alone. The manuscript must exhibit the precise recurrence or assumption (e.g., on perturbation radius μ relative to β or network mixing) that produces this scaling, as the abstract states the rate without derivation steps or error-bound details.

    Authors: We appreciate the referee highlighting the need for explicit derivation. In our analysis, the quadratic O((1-β)^2) suppression emerges from the coupled dynamics of momentum-based variance reduction and gradient tracking with Rademacher perturbations: the heterogeneity bias term is attenuated once by the momentum factor in the variance-reduced estimator and again through the tracking update, where the perturbation radius μ is chosen proportionally to (1-β) to ensure the extra cancellation. This yields a tighter recurrence than the standard e ≤ β e + α b form. We will insert the precise recurrence relation, the assumption linking μ to β, and the full error-bound derivation into the revised theoretical analysis section. revision: yes

  2. Referee: [Convergence theorem] Convergence theorem (likely §4): the O(1/T) rate and bias-suppression result are stated without an explicit assumption list, step-size schedule, or bound on the ZO estimator variance in the provided text. This prevents verification that the rates hold under the “standard smoothness and bounded-variance conditions” referenced in the reader’s note, especially when the central claim of quadratic suppression is load-bearing for the paper’s novelty.

    Authors: We agree that the theorem presentation should be fully self-contained. The analysis relies on standard L-smoothness, bounded second-moment of the Rademacher-based ZO estimator (with the explicit variance bound derived in the appendix), and a connected undirected network with spectral gap. The step-size schedule is α = Θ(1/√T) (with appropriate β-dependent constants) to obtain the O(1/T) rate. We will add a dedicated assumption list immediately before the main theorem, state the step-size schedule and ZO variance bound explicitly in the theorem statement, and reference the appendix for the full proof details in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; rates derived from standard analysis

full rationale

The paper presents a theoretical convergence analysis for ZO-MGT under standard smoothness and bounded-variance assumptions, claiming O(1/T) rate and O((1-β)^2) bias suppression. No equations, derivations, or self-citations in the abstract reduce these claims to fitted inputs, self-definitions, or prior author results by construction. The derivation chain is presented as independent mathematical proof rather than tautological renaming or load-bearing self-reference. This is the expected outcome for a standard convergence proof paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the method implicitly relies on standard non-convex optimization assumptions (smoothness, bounded variance) that are not enumerated here.

free parameters (1)
  • momentum factor β
    Tunable parameter whose large value is claimed to drive quadratic bias suppression; its specific value is chosen by the user.
axioms (1)
  • domain assumption Objective function is smooth with bounded gradient variance or similar standard conditions for ZO convergence proofs.
    Required for any O(1/T) rate in non-convex zeroth-order optimization; invoked implicitly by the stated analysis.

pith-pipeline@v0.9.0 · 5503 in / 1330 out tokens · 38384 ms · 2026-05-09T21:45:48.688745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    Distributed subgradient methods for multi- agent optimization,

    A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi- agent optimization,”IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, Jan. 2009

  2. [2]

    On the linear convergence of the ADMM in decentralized consensus optimization,

    W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1751–1761, Apr. 2014

  3. [3]

    Linear convergence rate of a class of distributed augmented Lagrangian algorithms,

    D. Jakoveti ´c, J. M. F. Moura, and J. Xavier, “Linear convergence rate of a class of distributed augmented Lagrangian algorithms,”IEEE Transactions on Automatic Control, vol. 60, no. 4, pp. 922–936, Apr. 2015

  4. [4]

    EXTRA: An exact first-order algorithm for decentralized consensus optimization,

    W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,”SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, Jan. 2015

  5. [5]

    Exact diffusion for de- centralized optimization and learning—Part I: Algorithm and analysis,

    K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for de- centralized optimization and learning—Part I: Algorithm and analysis,” IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 708–723, Feb. 2018

  6. [6]

    Harnessing smoothness to accelerate distributed optimization,

    G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,”IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, Sep. 2018

  7. [7]

    Achieving geometric convergence for distributed optimization over time-varying graphs,

    A. Nedi ´c, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,”SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, Jan. 2017

  8. [8]

    Distributed stochastic gradient tracking methods,

    S. Pu and A. Nedi ´c, “Distributed stochastic gradient tracking methods,” Mathematical Programming, vol. 187, no. 1, pp. 409–457, May 2021

  9. [9]

    A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications,

    S. Liu, P.-Y . Chen, B. Kailkhura, G. Zhang, A. O. Hero, and P. K. Varshney, “A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications,” IEEE Signal Processing Magazine, vol. 37, no. 5, pp. 43–54, Sep. 2020. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

  10. [10]

    Random gradient-free minimization of convex functions,

    Y . Nesterov and V . Spokoiny, “Random gradient-free minimization of convex functions,”Foundations of Computational Mathematics, vol. 17, no. 2, pp. 527–566, Apr. 2017

  11. [11]

    Fine-tuning language models with just forward passes,

    S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora, “Fine-tuning language models with just forward passes,”Ad- vances in Neural Information Processing Systems, vol. 36, pp. 53 038– 53 075, Dec. 2023

  12. [12]

    Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer.arXiv preprint arXiv:2402.15173,

    Y . Zhao, S. Dang, H. Ye, G. Dai, Y . Qian, and I. W. Tsang, “Second- order fine-tuning without pain for LLMs: A hessian informed zeroth- order optimizer,”arXiv preprint arXiv:2402.15173, Feb. 2025

  13. [13]

    Why does adaptive zeroth-order optimization work?

    H. Ye and L. Luo, “Why does adaptive zeroth-order optimization work?” arXiv preprint arXiv:2602.01627, Feb. 2026

  14. [14]

    ZOO: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,

    P.-Y . Chen, H. Zhang, Y . Sharma, J. Yi, and C.-J. Hsieh, “ZOO: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” inProceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Nov. 2017, pp. 15–26

  15. [15]

    Global convergence of policy gradient methods for the linear quadratic regulator,

    M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” inIn- ternational Conference on Machine Learning. PMLR, Jul. 2018, pp. 1467–1476

  16. [16]

    FZOO: Fast zeroth-order optimizer for fine-tuning large language models towards adam-scale speed,

    S. Dang, Y . Guo, Y . Zhao, H. Ye, X. Zheng, G. Dai, and I. Tsang, “FZOO: Fast zeroth-order optimizer for fine-tuning large language models towards adam-scale speed,”arXiv preprint arXiv:2506.09034, Jun. 2025

  17. [17]

    Distributed zero-order algorithms for nonconvex multiagent optimization,

    Y . Tang, J. Zhang, and N. Li, “Distributed zero-order algorithms for nonconvex multiagent optimization,”IEEE Transactions on Control of Network Systems, vol. 8, no. 1, pp. 269–281, Mar. 2021

  18. [18]

    Zeroth-order algorithms for stochastic distributed nonconvex optimization,

    X. Yi, S. Zhang, T. Yang, and K. H. Johansson, “Zeroth-order algorithms for stochastic distributed nonconvex optimization,”Automatica, vol. 142, p. 110353, Aug. 2022

  19. [19]

    Communication-efficient stochastic zeroth-order optimization for fed- erated learning,

    W. Fang, Z. Yu, Y . Jiang, Y . Shi, C. N. Jones, and Y . Zhou, “Communication-efficient stochastic zeroth-order optimization for fed- erated learning,”IEEE Transactions on Signal Processing, vol. 70, pp. 5058–5073, Oct. 2022

  20. [20]

    Adaptive and communication- efficient zeroth-order optimization for distributed internet of things,

    Q. Dang, S. Yang, Q. Liu, and J. Ruan, “Adaptive and communication- efficient zeroth-order optimization for distributed internet of things,” IEEE Internet of Things Journal, vol. 11, no. 22, pp. 37 200–37 213, Nov. 2024

  21. [21]

    Zeroth-order feedback-based optimization for distributed demand response,

    R. Jin, Y . Tang, and J. Song, “Zeroth-order feedback-based optimization for distributed demand response,”arXiv preprint arXiv:2311.00372, Nov. 2023

  22. [22]

    Heterogeneous distributed zeroth-order nonconvex optimization with communication compression,

    H. Wang, X. Yi, Y . Hong, and M. Liwang, “Heterogeneous distributed zeroth-order nonconvex optimization with communication compression,” arXiv preprint arXiv:2602.08659, Feb. 2026

  23. [23]

    Distributed zeroth-order gradi- ent tracking for weakly convex optimization over unbalanced graphs,

    R. Wang, S. Cheng, Y . Fan, and J. Qiu, “Distributed zeroth-order gradi- ent tracking for weakly convex optimization over unbalanced graphs,” IEEE Transactions on Signal and Information Processing over Networks, vol. 11, pp. 1515–1526, Jan. 2025

  24. [24]

    Compressed distributed zeroth-order gradient tracking for nonconvex optimization,

    L. Xu, X. Yi, J. Sun, G. Wen, T. Chai, and T. Yang, “Compressed distributed zeroth-order gradient tracking for nonconvex optimization,” in2024 IEEE 18th International Conference on Control & Automation (ICCA), Jun. 2024, pp. 603–608

  25. [25]

    Single point-based distributed zeroth- order optimization with a non-convex stochastic objective function,

    E. Mhanna and M. Assaad, “Single point-based distributed zeroth- order optimization with a non-convex stochastic objective function,” in Proceedings of the 40th International Conference on Machine Learning (ICML). PMLR, Jul. 2023, pp. 24 701–24 719

  26. [26]

    Decentralized gradient-free meth- ods for stochastic non-smooth non-convex optimization,

    Z. Lin, J. Xia, Q. Deng, and L. Luo, “Decentralized gradient-free meth- ods for stochastic non-smooth non-convex optimization,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, Mar. 2024, pp. 17 477–17 486

  27. [27]

    Variance-reduced gradient estimator for nonconvex zeroth-order distributed optimization,

    H. Mu, Y . Tang, and Z. Li, “Variance-reduced gradient estimator for nonconvex zeroth-order distributed optimization,” in2025 American Control Conference (ACC), Jul. 2025, pp. 1667–1674

  28. [28]

    Zeroth-order stochastic variance reduction for nonconvex optimization,

    S. Liu, B. Kailkhura, P.-Y . Chen, P. Ting, S. Chang, and L. Amini, “Zeroth-order stochastic variance reduction for nonconvex optimization,” inAdvances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., 2018

  29. [29]

    Improved zeroth-order vari- ance reduced algorithms and analysis for nonconvex optimization,

    K. Ji, Z. Wang, Y . Zhou, and Y . Liang, “Improved zeroth-order vari- ance reduced algorithms and analysis for nonconvex optimization,” in Proceedings of the 36th International Conference on Machine Learning (ICML). PMLR, 2019, pp. 3100–3109

  30. [30]

    ZIVR: An incremental variance reduc- tion technique for zeroth-order composite problems,

    S. Zhang and Y . Tang, “ZIVR: An incremental variance reduc- tion technique for zeroth-order composite problems,”arXiv preprint arXiv:2601.05056, Jan. 2026

  31. [31]

    Momentum-based variance reduction in non-convex SGD,

    A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex SGD,” inAdvances in Neural Information Processing Systems, vol. 32, Dec. 2019

  32. [32]

    ZO-AdaMM: Zeroth-order adaptive momentum method for black-box optimization,

    X. Chen, S. Liu, K. Xu, X. Li, X. Lin, M. Hong, and D. Cox, “ZO-AdaMM: Zeroth-order adaptive momentum method for black-box optimization,” inAdvances in Neural Information Processing Systems, vol. 32, Dec. 2019

  33. [33]

    Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization,

    F. Huang, S. Gao, J. Pei, and H. Huang, “Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization,” Journal of Machine Learning Research, vol. 23, no. 36, pp. 1–70, 2022

  34. [34]

    Momentum-based zeroth-order gradient method for distributed black-box optimization,

    Q. Dang, B. Li, and X. He, “Momentum-based zeroth-order gradient method for distributed black-box optimization,”IEEE Transactions on Artificial Intelligence, 2026. Yanxu Su(Member, IEEE) received his B.E. and M.E. degrees from the College of Automation Engi- neering, Nanjing University of Aeronautics and As- tronautics, Nanjing, China, in Control Engineer...