pith. sign in

arxiv: 2604.08236 · v1 · submitted 2026-04-09 · 🧮 math.OC

Improved Convergence for Decentralized Stochastic Optimization with Biased Gradients

Pith reviewed 2026-05-10 16:47 UTC · model grok-4.3

classification 🧮 math.OC
keywords decentralized optimizationstochastic optimizationbiased gradientsmomentum trackingnonconvex optimizationlinear speedupdata heterogeneityconvergence analysis
0
0 comments X

The pith

Biased-DMT achieves linear speedup and decouples network topology from data heterogeneity in nonconvex decentralized stochastic optimization with biased gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Biased-DMT, a decentralized momentum tracking algorithm built to handle biased gradient estimates that arise from compression or inexact local oracles. It proves that this method converges in nonconvex settings while delivering linear speedup with the number of agents and separating the influence of the communication network from data heterogeneity. The analysis shows robust behavior even on sparse networks. With absolute bias the algorithm removes structural heterogeneity error and reaches the exact physical error floor; with relative bias the remaining error stems directly from locally injected noise.

Core claim

We propose Decentralized Momentum Tracking with Biased Gradients (Biased-DMT) and establish its convergence theory for nonconvex stochastic optimization. The algorithm achieves linear speedup with respect to the number of agents, decouples network topology effects from data heterogeneity, and enables reliable performance under biased gradients. When the oracle supplies only absolute bias, Biased-DMT eliminates structural heterogeneity error and converges to the physical error floor; for relative bias the limit is characterized as an unavoidable consequence of local noise.

What carries the argument

Biased-DMT, a decentralized momentum tracking algorithm that maintains consensus on biased gradients and separates network topology from data heterogeneity.

If this is right

  • Linear speedup is obtained with respect to the number of agents in nonconvex settings.
  • Network topology effects are separated from data heterogeneity, so sparse networks remain effective.
  • Absolute bias leads to elimination of structural heterogeneity error and convergence to the exact physical error floor.
  • Relative bias produces a convergence limit set by locally injected noise.
  • The algorithm works reliably under communication compression or inexact local oracles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decoupling result suggests that communication network design can be chosen independently of data partitioning in large-scale systems.
  • The method could be tested on time-varying or directed graphs to see whether the same separation of topology and heterogeneity persists.
  • Practical implementations with gradient compression could use the absolute-bias case to reach the noise floor without additional heterogeneity penalties.

Load-bearing premise

The convergence claims rest on standard nonconvex smoothness and variance bounds together with the assumption that gradient bias is bounded in either absolute or relative form.

What would settle it

Measure whether the observed convergence rate scales linearly with the number of agents in experiments that increase the agent count while holding bias fixed, or check if the final error matches the predicted physical floor exactly when absolute bias is the only source of error.

Figures

Figures reproduced from arXiv: 2604.08236 by Qing Xu, Songyi Dian, Wenqi Fan, Xingxing You, Yiwei Liao.

Figure 1
Figure 1. Figure 1: Convergence comparison under biased gradient oracle and Non-IID [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Convergence of Biased-DMT under varying gradient bias magnitudes. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Decentralized stochastic optimization has emerged as a fundamental paradigm for large-scale machine learning. However, practical implementations often rely on biased gradient estimators arising from communication compression or inexact local oracles, which severely degrade convergence in the presence of data heterogeneity. To address the challenge, we propose Decentralized Momentum Tracking with Biased Gradients (Biased-DMT), a novel decentralized algorithm designed to operate reliably under biased gradient information. We establish a comprehensive convergence theory for Biased-DMT in nonconvex settings and show that it achieves linear speedup with respect to the number of agents. The theoretical analysis shows that Biased-DMT decouples the effects of network topology from data heterogeneity, enabling robust performance even in sparse communication networks. Notably, when the gradient oracle introduces only absolute bias, the proposed method eliminates the structural heterogeneity error and converges to the exact physical error floor. For the case of relative bias, we further characterize the convergence limit and show that the remaining error is an unavoidable physical consequence of locally injected noise. Extensive numerical experiments corroborate our theoretical analysis and demonstrate the practical effectiveness of Biased-DMT across a range of decentralized learning scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Decentralized Momentum Tracking with Biased Gradients (Biased-DMT) for nonconvex decentralized stochastic optimization under biased gradient oracles. It claims a comprehensive convergence theory establishing linear speedup in the number of agents, decoupling of network topology effects from data heterogeneity (enabling robustness on sparse graphs), elimination of structural heterogeneity error under absolute bias (converging to the exact physical error floor), and characterization of an unavoidable noise floor under relative bias. Numerical experiments are presented to corroborate the theory.

Significance. If the decoupling and bias-dependent error elimination hold under standard nonconvex assumptions, the result would be significant for decentralized ML, as it directly addresses degradation from practical biased estimators (compression, inexact oracles) while preserving linear speedup and sparse-network performance. The explicit separation of absolute vs. relative bias effects and the momentum-tracking mechanism for consensus-error cancellation would be a useful addition to the literature on compressed or inexact decentralized optimization.

major comments (2)
  1. [Theoretical analysis (convergence bounds and Lyapunov function)] The central decoupling claim (network topology independent of data heterogeneity) and the elimination of the structural heterogeneity term under absolute bias require that the bias bound does not interact with local gradients or consensus errors in the Lyapunov analysis or variance decomposition. The descent lemma must be checked to confirm that momentum tracking cancels the term independently of graph Laplacian eigenvalues/spectral gap, without residual factors when the absolute bias bound is agent-dependent or gradient-norm correlated.
  2. [Convergence theorem statements] The linear speedup w.r.t. number of agents and sparse-network robustness rest on the above decoupling; if the proof retains a mixing-rate factor under the stated bias models, the claims on robust performance in sparse communication networks would be weakened. Explicit constants for smoothness, variance, and connectivity assumptions should be stated to allow verification of the rates.
minor comments (2)
  1. [Abstract and §1] The abstract and introduction would benefit from a brief statement of the precise assumptions (L-smoothness, bounded variance, network connectivity) and the exact form of the absolute/relative bias models to make the claims immediately checkable.
  2. [Numerical experiments section] Figure captions and experiment details (network topologies, bias magnitudes, baselines) should be expanded for reproducibility; current description is high-level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive suggestions on the theoretical analysis. We have revised the manuscript to address the points raised, including clarifications to the Lyapunov analysis and explicit statement of constants in the convergence theorems.

read point-by-point responses
  1. Referee: [Theoretical analysis (convergence bounds and Lyapunov function)] The central decoupling claim (network topology independent of data heterogeneity) and the elimination of the structural heterogeneity term under absolute bias require that the bias bound does not interact with local gradients or consensus errors in the Lyapunov analysis or variance decomposition. The descent lemma must be checked to confirm that momentum tracking cancels the term independently of graph Laplacian eigenvalues/spectral gap, without residual factors when the absolute bias bound is agent-dependent or gradient-norm correlated.

    Authors: We appreciate this careful scrutiny of the proof structure. Under the absolute bias model (Assumption 3 in the manuscript), the bias is bounded by a constant B independent of the local gradient norm and consensus error; this allows the bias term to be pulled out of the variance decomposition without introducing gradient-dependent factors. The momentum tracking update is designed to cancel the consensus error in the Lyapunov function V_t = f(x_t) + (1/2) ||x_t - bar x_t||^2 + momentum terms, and the descent lemma is applied after this cancellation. The resulting bound on the heterogeneity term is independent of the Laplacian eigenvalues because the tracking mechanism absorbs the mixing rate into a geometric contraction that does not multiply the bias. For agent-dependent biases we retain the uniform bound sup_i B_i <= B as stated; no gradient-norm correlation appears in the absolute-bias case. We have added an expanded step-by-step derivation of the descent lemma and variance decomposition in the revised appendix (new Section C.3) to make this independence explicit. revision: yes

  2. Referee: [Convergence theorem statements] The linear speedup w.r.t. number of agents and sparse-network robustness rest on the above decoupling; if the proof retains a mixing-rate factor under the stated bias models, the claims on robust performance in sparse communication networks would be weakened. Explicit constants for smoothness, variance, and connectivity assumptions should be stated to allow verification of the rates.

    Authors: The linear speedup O(1/sqrt(m T)) and the decoupling from network topology follow directly from the bias-independent cancellation shown above; the final rate contains no residual spectral-gap factor multiplying the heterogeneity or bias terms. The only connectivity dependence appears in the transient consensus term, which is controlled by the momentum parameter and vanishes at the same rate as the optimization error. We have revised Theorem 1 and Theorem 2 to display all constants explicitly: smoothness L, variance bound sigma^2, absolute-bias bound B, number of agents m, and the graph connectivity parameter chi (defined as the maximum eigenvalue ratio of the mixing matrix). These appear both in the main theorem statements and in the corollary for sparse graphs (Corollary 1). revision: yes

Circularity Check

0 steps flagged

No circularity detected; theoretical claims lack visible self-referential reductions or fitted inputs

full rationale

The provided abstract and context contain no equations, derivations, or explicit self-citations that reduce claims to inputs by construction. Claims of decoupling network topology from data heterogeneity and error elimination under absolute bias are presented as outcomes of the analysis, but without quotable steps (e.g., no descent lemmas, Lyapunov functions, or parameter fits shown), none of the enumerated circularity patterns apply. The reader's note confirms no equations are visible, and skeptic concerns address potential proof gaps rather than definitional circularity. This is the standard honest non-finding for abstract-only or equation-free presentations; the derivation is treated as self-contained pending full text inspection.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The paper rests on standard domain assumptions for nonconvex stochastic decentralized optimization plus bounded bias models; no free parameters or new entities are mentioned in the abstract.

axioms (3)
  • domain assumption Objective functions are nonconvex and L-smooth
    Required for the stated nonconvex convergence theory.
  • domain assumption Gradient oracles have bounded absolute or relative bias
    Central to the bias-handling and error-floor claims.
  • domain assumption Stochastic gradients have bounded variance and the communication network is connected
    Standard for decentralized optimization analysis and linear speedup.

pith-pipeline@v0.9.0 · 5504 in / 1337 out tokens · 58248 ms · 2026-05-10T16:47:45.315650+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Distributed subgradient methods for multi- agent optimization,

    A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi- agent optimization,”IEEE Trans. Autom. Control, vol. 54, no. 1, pp. 48–61, 2009

  2. [2]

    On the convergence of decentralized gradient descent,

    K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,”SIAM J. Optim., vol. 26, no. 3, pp. 1835–1854, 2016

  3. [3]

    Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,

    X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 30, 2017

  4. [4]

    D 2: Decentralized training over decentralized data,

    H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D 2: Decentralized training over decentralized data,” inInt. Conf. Mach. Learn. (ICML), pp. 4848–4856, 2018

  5. [5]

    An improved convergence analysis for decentralized online stochastic nonconvex optimization,

    R. Xin, U. A. Khan, and S. Kar, “An improved convergence analysis for decentralized online stochastic nonconvex optimization,”IEEE Trans. Signal Process., vol. 69, pp. 1842–1858, 2020

  6. [6]

    Distributed stochastic gradient tracking methods,

    S. Pu and A. Nedic, “Distributed stochastic gradient tracking methods,” Math. Program., vol. 187, no. 1, pp. 409–457, 2021

  7. [7]

    GNSD: A gradient-tracking based nonconvex stochastic algorithm for decentralized optimization,

    S. Lu, X. Zhang, H. Sun, and M. Hong, “GNSD: A gradient-tracking based nonconvex stochastic algorithm for decentralized optimization,” inIEEE Data Sci. Workshop (DSW), pp. 315–321, 2019

  8. [8]

    On the linear speedup analysis of communication efficient momentum SGD for distributed nonconvex optimization,

    H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communication efficient momentum SGD for distributed nonconvex optimization,” inInt. Conf. Mach. Learn. (ICML), pp. 7184–7193, 2019

  9. [9]

    Momentum tracking: Momentum acceleration for decentralized deep learning on heterogeneous data,

    Y . Takezawa, H. Bao, K. Niwa, R. Sato, and M. Yamada, “Momentum tracking: Momentum acceleration for decentralized deep learning on heterogeneous data,” inTrans. Mach. Learn. Res., 2023

  10. [10]

    STEM: A stochastic two-sided momentum algorithm minimizing a smooth nonconvex objective,

    P. Khanduri, P. Sharma, H. Yang, M. Hong, J. Liu, K. Rajawat, and P. Varshney, “STEM: A stochastic two-sided momentum algorithm minimizing a smooth nonconvex objective,” inInt. Conf. Artif. Intell. Stat. (AISTATS), pp. 3376–3384, 2021

  11. [11]

    Towards faster decentralized stochastic optimization with communication compression,

    R. Islamov, Y . Gao, and S. U. Stich, “Towards faster decentralized stochastic optimization with communication compression,” inInt. Conf. Learn. Represent. (ICLR), 2025

  12. [12]

    Distributed stochastic momentum tr acking with local updates: Achieving optimal communication and iterat ion complex- ities,

    K. Huang and S. Pu, “Distributed stochastic momentum tracking with local updates: Achieving optimal communication and iteration complex- ities,”arXiv preprint arXiv:2510.24155, 2025

  13. [13]

    Decentralized deep learning with arbitrary communication compression,

    A. Koloskova, T. Lin, S. U. Stich, and M. Jaggi, “Decentralized deep learning with arbitrary communication compression,” inInt. Conf. Learn. Represent. (ICLR), 2020

  14. [14]

    On biased compression for distributed learning

    A. Beznosikov, S. Horv ´ath, P. Richt ´arik, and M. Safaryan, “On biased compression for distributed learning,”arXiv preprint arXiv:2002.12410, 2020

  15. [15]

    A compressed gradient tracking method for decentralized optimization with linear convergence,

    Y . Liao, Z. Li, K. Huang, and S. Pu, “A compressed gradient tracking method for decentralized optimization with linear convergence,”IEEE Trans. Autom. Control, vol. 68, no. 10, pp. 6279–6286, 2023

  16. [16]

    A linearly convergent robust compressed push-pull method for decentralized optimization,

    Y . Liao, Z. Li, and S. Pu, “A linearly convergent robust compressed push-pull method for decentralized optimization,” inProc. IEEE 62nd Conf. Decis. Control (CDC), pp. 4156–4161, 2023

  17. [17]

    Distributed zeroth-order gradient tracking for weakly convex optimization over unbalanced graphs,

    X. Jiang, X. Zeng, J. Sun, and J. Chen, “Distributed zeroth-order gradient tracking for weakly convex optimization over unbalanced graphs,”IEEE Trans. Autom. Control, vol. 69, no. 3, pp. 1664–1679, 2024

  18. [18]

    On the convergence of decen- tralized stochastic gradient descent with biased gradients,

    Y . Jiang, H. Kang, J. Liu, and D. Xu, “On the convergence of decen- tralized stochastic gradient descent with biased gradients,”IEEE Trans. Signal Process., vol. 73, pp. 205–218, 2025

  19. [19]

    Analysis of SGD with biased gradient estimators,

    A. Ajalloeian and S. U. Stich, “Analysis of SGD with biased gradient estimators,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, 2020

  20. [20]

    Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data,

    T. Lin, S. P. Karimireddy, S. U. Stich, and M. Jaggi, “Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data,” inInt. Conf. Mach. Learn. (ICML), pp. 6661–6671, 2021

  21. [21]

    Momentum-based variance reduction in nonconvex SGD,

    A. Cutkosky and F. Orabona, “Momentum-based variance reduction in nonconvex SGD,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019

  22. [22]

    Communication-efficient distributed optimization in networks with gradient tracking and variance reduction,

    B. Li, S. Cen, Y . Chen, and Y . Chi, “Communication-efficient distributed optimization in networks with gradient tracking and variance reduction,” inInt. Conf. Artif. Intell. Stat. (AISTATS), pp. 1662–1684, 2020

  23. [23]

    Linear conver- gence of first- and zeroth-order primal-dual algorithms for distributed nonconvex optimization,

    X. Yi, S. Zhang, T. Yang, T. Chai, and K. H. Johansson, “Linear conver- gence of first- and zeroth-order primal-dual algorithms for distributed nonconvex optimization,”IEEE Trans. Autom. Control, vol. 68, no. 7, pp. 4001–4016, 2023