CGAD is a staleness-aware Adam variant for DiLoCo that gates gradients with cosine and exponential decay, proves a convergence bound independent of maximum delay, and demonstrates stable pretraining of 25M to 7B parameter Llama-style models across controlled delays.
Streaming diloco with overlapping communication: Towards a distributed free lunch
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3verdicts
UNVERDICTED 3representative citing papers
ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
TAH-Quant introduces tile-wise adaptive Hadamard quantization for activations in pipeline parallelism, achieving 3-4 bit compression with up to 4.3x throughput speedup and O(1/sqrt(T)) convergence matching SGD.
citing papers explorer
-
Cosine-Gated Adam-Decay: Drop-In Staleness-Aware Outer Optimization for Decoupled DiLoCo
CGAD is a staleness-aware Adam variant for DiLoCo that gates gradients with cosine and exponential decay, proves a convergence bound independent of maximum delay, and demonstrates stable pretraining of 25M to 7B parameter Llama-style models across controlled delays.
-
ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism
ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
-
TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network
TAH-Quant introduces tile-wise adaptive Hadamard quantization for activations in pipeline parallelism, achieving 3-4 bit compression with up to 4.3x throughput speedup and O(1/sqrt(T)) convergence matching SGD.