CGAD is a staleness-aware Adam variant for DiLoCo that gates gradients with cosine and exponential decay, proves a convergence bound independent of maximum delay, and demonstrates stable pretraining of 25M to 7B parameter Llama-style models across controlled delays.
Streaming diloco with overlapping communication: Towards a distributed free lunch
5 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 5representative citing papers
ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
TAH-Quant introduces tile-wise adaptive Hadamard quantization for activations in pipeline parallelism, achieving 3-4 bit compression with up to 4.3x throughput speedup and O(1/sqrt(T)) convergence matching SGD.
HeLoCo corrects misaligned pseudo-gradients in asynchronous low-communication training via outer momentum reference, yielding up to 7.5% better loss at fixed tokens and 22.1% over synchronous under severe heterogeneity.
Periodic outer-momentum restarts in two-phase optimizers exploit phase cancellation in a linearized NTK model to widen stable learning-rate and momentum ranges in language-model pretraining.
citing papers explorer
-
HeLoCo: Efficient asynchronous low-communication training under data and device heterogeneity
HeLoCo corrects misaligned pseudo-gradients in asynchronous low-communication training via outer momentum reference, yielding up to 7.5% better loss at fixed tokens and 22.1% over synchronous under severe heterogeneity.