Toward Understanding the Impact of Staleness in Distributed Machine Learning

· 2018 · cs.LG · arXiv 1810.03264

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Many distributed machine learning (ML) systems adopt the non-synchronous execution in order to alleviate the network communication bottleneck, resulting in stale parameters that do not reflect the latest updates. Despite much development in large-scale ML, the effects of staleness on learning are inconclusive as it is challenging to directly monitor or control staleness in complex distributed environments. In this work, we study the convergence behaviors of a wide array of ML models and algorithms under delayed updates. Our extensive experiments reveal the rich diversity of the effects of staleness on the convergence of ML algorithms and offer insights into seemingly contradictory reports in the literature. The empirical findings also inspire a new convergence analysis of stochastic gradient descent in non-convex optimization under staleness, matching the best-known convergence rate of O(1/\sqrt{T}).

representative citing papers

Breaking the Capacity Bottleneck in Model-Heterogeneous Federated Learning via Gradual Model Restoration

cs.DC · 2025-12-05 · unverdicted · novelty 6.0

FedGMR progressively restores sub-model capacity for bandwidth-constrained clients via gradual density increases and mask-aware aggregation, narrowing the gap to full-model federated learning.

HeLoCo: Efficient asynchronous low-communication training under data and device heterogeneity

cs.DC · 2026-05-29 · unverdicted · novelty 5.0

HeLoCo corrects misaligned pseudo-gradients in asynchronous low-communication training via outer momentum reference, yielding up to 7.5% better loss at fixed tokens and 22.1% over synchronous under severe heterogeneity.

citing papers explorer

Showing 1 of 1 citing paper after filters.

HeLoCo: Efficient asynchronous low-communication training under data and device heterogeneity cs.DC · 2026-05-29 · unverdicted · none · ref 25 · internal anchor
HeLoCo corrects misaligned pseudo-gradients in asynchronous low-communication training via outer momentum reference, yielding up to 7.5% better loss at fixed tokens and 22.1% over synchronous under severe heterogeneity.

Toward Understanding the Impact of Staleness in Distributed Machine Learning

fields

years

verdicts

representative citing papers

citing papers explorer