Streaming diloco with overlapping communication: Towards a distributed free lunch

Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey , Ross McIlroy , Jiajun Shen, et al · 2025 · arXiv 2501.18512

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

Cosine-Gated Adam-Decay: Drop-In Staleness-Aware Outer Optimization for Decoupled DiLoCo

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

CGAD is a staleness-aware Adam variant for DiLoCo that gates gradients with cosine and exponential decay, proves a convergence bound independent of maximum delay, and demonstrates stable pretraining of 25M to 7B parameter Llama-style models across controlled delays.

ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.

TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

cs.LG · 2025-06-02 · unverdicted · novelty 6.0

TAH-Quant introduces tile-wise adaptive Hadamard quantization for activations in pipeline parallelism, achieving 3-4 bit compression with up to 4.3x throughput speedup and O(1/sqrt(T)) convergence matching SGD.

citing papers explorer

Showing 3 of 3 citing papers.

Cosine-Gated Adam-Decay: Drop-In Staleness-Aware Outer Optimization for Decoupled DiLoCo cs.LG · 2026-05-09 · unverdicted · none · ref 3
CGAD is a staleness-aware Adam variant for DiLoCo that gates gradients with cosine and exponential decay, proves a convergence bound independent of maximum delay, and demonstrates stable pretraining of 25M to 7B parameter Llama-style models across controlled delays.
ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism cs.LG · 2026-04-13 · unverdicted · none · ref 1
ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network cs.LG · 2025-06-02 · unverdicted · none · ref 9
TAH-Quant introduces tile-wise adaptive Hadamard quantization for activations in pipeline parallelism, achieving 3-4 bit compression with up to 4.3x throughput speedup and O(1/sqrt(T)) convergence matching SGD.

Streaming diloco with overlapping communication: Towards a distributed free lunch

fields

years

verdicts

representative citing papers

citing papers explorer