Streaming diloco with overlapping communication: Towards a distributed free lunch

doi: 10 · 2025 · arXiv 2501.18512

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

Cosine-Gated Adam-Decay: Drop-In Staleness-Aware Outer Optimization for Decoupled DiLoCo

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

CGAD is a staleness-aware Adam variant for DiLoCo that gates gradients with cosine and exponential decay, proves a convergence bound independent of maximum delay, and demonstrates stable pretraining of 25M to 7B parameter Llama-style models across controlled delays.

ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.

TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

cs.LG · 2025-06-02 · unverdicted · novelty 6.0

TAH-Quant introduces tile-wise adaptive Hadamard quantization for activations in pipeline parallelism, achieving 3-4 bit compression with up to 4.3x throughput speedup and O(1/sqrt(T)) convergence matching SGD.

HeLoCo: Efficient asynchronous low-communication training under data and device heterogeneity

cs.DC · 2026-05-29 · unverdicted · novelty 5.0

HeLoCo corrects misaligned pseudo-gradients in asynchronous low-communication training via outer momentum reference, yielding up to 7.5% better loss at fixed tokens and 22.1% over synchronous under severe heterogeneity.

Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization

cs.LG · 2026-05-27 · unverdicted · novelty 5.0

Periodic outer-momentum restarts in two-phase optimizers exploit phase cancellation in a linearized NTK model to widen stable learning-rate and momentum ranges in language-model pretraining.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Cosine-Gated Adam-Decay: Drop-In Staleness-Aware Outer Optimization for Decoupled DiLoCo cs.LG · 2026-05-09 · unverdicted · none · ref 3
CGAD is a staleness-aware Adam variant for DiLoCo that gates gradients with cosine and exponential decay, proves a convergence bound independent of maximum delay, and demonstrates stable pretraining of 25M to 7B parameter Llama-style models across controlled delays.
ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism cs.LG · 2026-04-13 · unverdicted · none · ref 1
ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network cs.LG · 2025-06-02 · unverdicted · none · ref 9
TAH-Quant introduces tile-wise adaptive Hadamard quantization for activations in pipeline parallelism, achieving 3-4 bit compression with up to 4.3x throughput speedup and O(1/sqrt(T)) convergence matching SGD.
Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization cs.LG · 2026-05-27 · unverdicted · none · ref 10
Periodic outer-momentum restarts in two-phase optimizers exploit phase cancellation in a linearized NTK model to widen stable learning-rate and momentum ranges in language-model pretraining.

Streaming diloco with overlapping communication: Towards a distributed free lunch

fields

years

verdicts

representative citing papers

citing papers explorer