MuLoCo: Muon is a practical inner optimizer for DiLoCo

Aaron Defazio; Benjamin Th\'erien; Eugene Belilovsky; Irina Rish; Xiaolong Huang

MuLoCo: Muon is a practical inner optimizer for DiLoCo

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2505.23725 v3 pith:CCU6BTQW submitted 2025-05-29 cs.LG

MuLoCo: Muon is a practical inner optimizer for DiLoCo

Benjamin Th\'erien , Xiaolong Huang , Aaron Defazio , Irina Rish , Eugene Belilovsky This is my paper

classification cs.LG

keywords dilocomulocomuonoptimizerfindwhileworkersadamw

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

DiLoCo is a powerful framework for training large language models (LLMs), enabling larger optimal batch sizes and increased accelerator utilization under networking constraints. However, DiLoCo's performance has been shown to degrade as the number of workers (K) increases (Charles et al., 2025). In this work, we posit that a related but often overlooked factor in DiLoCo's behavior is the choice of inner optimizer, which shapes the pseudogradient used by the outer optimizer. Given the recent success of Muon relative to AdamW for data parallel (DP) training, we examine how Muon's normalized optimizer steps can affect the pseudogradient's quality. We find that, relative to AdamW, Muon yields more directionally correct pseudogradients as the number of workers ($K$) increases. In our experiments pre-training language models, we conduct extensive hyperparameter tuning across 150M, 416M, 914M, 1.76B, and 3.1B models for DiLoCo, MuLoCo, AdamW DP, and Muon DP. Consistently across all scales, we find that with $K\geq1$ workers, MuLoCo (Muon inner optimizer DiLoCo) achieves superior performance to DiLoCo in absolute terms and for $K>2$ it outperforms DiLoCo relative to their data parallel baselines, while being compatible with quantization, streaming, and long synchronization intervals. At $K=1$, we find that MuLoCo can even outperform the data-parallel gold standard while having larger critical batch sizes. Finally, we extrapolate optimal hyperparameters to 15B scale and train a model with each method (six in total) using $K=1$ and $K=16$ workers. We find that $K=16$ MuLoCo nearly matches single-worker performance at this scale, while MuLoCo $K=1$ matches the best performing baseline while using a much larger $16$M token batch size.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unifying Local Communications and Local Updates for LLM Pretraining
cs.LG 2026-06 unverdicted novelty 6.0

GASLoC generalizes communication acceleration to the outer optimizer to enable gossip-based decentralized LLM pretraining that supports adaptive optimizers, local steps, and outperforms prior decentralized methods on ...
Decoupled DiLoCo for Resilient Distributed Pre-training
cs.CL 2026-04 unverdicted novelty 6.0

Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
cs.LG 2026-03 unverdicted novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Communication-Efficient Gluon in Federated Learning
cs.LG 2026-04 unverdicted novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.