Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

Erfan Miahi; Eugene Belilovsky

arxiv: 2602.03839 · v2 · pith:22A2GXQDnew · submitted 2026-02-03 · 💻 cs.LG

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

Erfan Miahi , Eugene Belilovsky This is my paper

Pith reviewed 2026-05-21 13:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords weight update sparsitydistributed reinforcement learningcommunication efficiencyBF16 precisionAdam optimizerPULSESyncPULSELoCoDiLoCo

0 comments

The pith

Most per-step Adam updates in BF16 RL training fall below the rounding threshold and never affect the forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that at typical learning rates used for reinforcement learning post-training of large models, roughly 99 percent of Adam weight updates are smaller than the local rounding error introduced by BF16 casting. Because these tiny changes produce no difference in the next forward pass, they can be skipped entirely when moving weights or pseudo-gradients between machines. The authors formalize this as compute-visible sparsification and implement it in two algorithms: PULSESync for lossless sparse BF16 weight patches sent from trainers to inference workers, and PULSELoCo for sparsified DiLoCo-style exchanges among trainers with error feedback. If the observation holds, distributed RL on commodity networks becomes feasible at scales previously limited by bandwidth, while preserving bit-identical weights and unchanged training dynamics.

Core claim

At the learning rates standard in RL post-training, Adam updates frequently lie below the BF16 rounding threshold, rendering approximately 99 percent of them invisible to subsequent forward passes. PULSESync exploits this by transmitting only the sparse BF16 patches that would alter the next computation, achieving over 100x reduction in trainer-to-inference communication while reconstructing trainer weights bit-identically at the inference workers. PULSELoCo applies the same principle with error feedback to pseudo-gradient synchronization, matching DiLoCo performance on four models while cutting trainer-to-trainer traffic by more than 17x versus DiLoCo and over 100x versus DDP in the largest

What carries the argument

Compute-visible sparsification: transmit only those weight or pseudo-gradient updates that would change the result of the next forward pass after BF16 casting.

If this is right

PULSESync delivers over 100x lower weight-synchronization volume while guaranteeing bit-identical weights at inference workers.
PULSELoCo matches full DiLoCo performance on multiple models with over 17x less trainer-to-trainer communication.
The same sparsification principle can be applied to both weight synchronization and pseudo-gradient exchanges in bandwidth-constrained settings.
Communication costs drop by more than 100x versus standard DDP in the largest evaluated configurations without loss of training fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rounding-threshold argument could be tested in other low-precision formats or optimizers to see whether comparable sparsity appears outside RL post-training.
If the sparsity pattern proves stable, the technique might extend naturally to inference-only serving clusters that also run occasional fine-tuning steps.
Hardware that natively supports sparse BF16 updates could further amplify the observed bandwidth savings.

Load-bearing premise

The observed 99 percent sparsity below the BF16 rounding threshold stays consistent across training steps and can be exploited without changing overall training dynamics.

What would settle it

A step-by-step measurement across an entire training run that shows either far fewer than 99 percent of updates fall below the BF16 threshold or that applying the sparsification produces measurable differences in final model quality or convergence speed.

Figures

Figures reproduced from arXiv: 2602.03839 by Erfan Miahi, Eugene Belilovsky.

**Figure 1.** Figure 1: Compute utilization vs. network bandwidth for a 7B model with 50s/step compute time. Full weight synchronization (14 GB) requires 20 Gbit/s links for 90% GPU utilization. PULSE reduces this to 0.2 Gbit/s (a 100× reduction) by transmitting only the 1% of parameters that change (140 MB). This enables efficient training over standard network connections, where the global median fixed broadband speed is low. W… view at source ↗

**Figure 2.** Figure 2: Weight update sparsity across model scales and families. (a) Mean per-step sparsity (%) averaged over 400 training steps. Error bars indicate ±1 standard deviation across steps. (b) Sparsity when comparing θt to θt+k for increasing k. Shaded regions indicate ±1 standard deviation. Within the recommended k ≤ 8 range for asynchronous RL [29], sparsity remains above 98% for all models. minimum sparsity (worst… view at source ↗

**Figure 3.** Figure 3: Why most weights cannot be updated in BF16. The diagonal line shows the minimum update size needed to change a weight (larger weights require larger updates). Horizontal lines show Adam update bounds at learning rate 3 × 10−6 : the effective bound (η) and the absorption bound (10η). The shaded region marks weights beyond the absorption bound, which are permanently frozen. Gray dots show that most LLM weigh… view at source ↗

**Figure 4.** Figure 4: b shows how both policy staleness and measurement granularity affect sparsity. For per-step updates (k = 1), sparsity remains above 98.5% even at 32-step staleness. When measuring over longer intervals (higher k), sparsity decreases as more parameters accumulate changes, but remains above 97.5% across all conditions tested. The shaded error bands show that variance increases modestly with both staleness an… view at source ↗

**Figure 5.** Figure 5: PULSE synchronization topology. Training nodes use high-bandwidth interconnects for dense gradient communication. Sparse weight patches are published to a central relay, enabling inference nodes to synchronize over commodity networks. using narrower integer types (e.g., uint16 instead of int32). Together, these contribute ∼23% additional compression beyond the sparse representation alone (Section E.4). Dec… view at source ↗

**Figure 6.** Figure 6: Training progress with PULSE. Validation pass@1 (colored lines) improves steadily while upload sizes (gray lines) remain stable throughout training. Each training window corresponds to approximately 6 minutes, during which up to 8 gradient steps may occur; the exact count varies due to the asynchronous nature of the system. The dashed line indicates the mean upload size of 108 MB, representing more than 10… view at source ↗

**Figure 7.** Figure 7: Update absorption in BF16 arithmetic. BF16 can represent only 128 distinct values between consecutive powers of two (shown as tick marks). Updates falling in the red zone round back to the original value and are absorbed; updates in the green zone survive. With learning rate 3 × 10−6 and weight w = 1.0, a typical update ∆w ≈ 3 × 10−6 falls deep in the absorption zone (ratio 3 × 10−6 ≪ 1/256). In mixed-prec… view at source ↗

**Figure 8.** Figure 8: Sparsity with mixed-precision training (FP32 master weights, BF16 computation). Training Qwen2.5-1.5B-Instruct with GRPO on MATH tasks. Validation pass@1 improves steadily while weight update sparsity (measured by casting FP32 master weights to BF16 and comparing consecutive steps) remains consistently above 99.4%. Shaded regions indicate ±1 standard deviation across 4 seeds. where pi , qi ≥ 0 are the weig… view at source ↗

**Figure 9.** Figure 9: Ratio |mˆ t|/ √ vˆt for an adversarial gradient sequence. The sequence consists of 105 near-zero gradients followed by constant gradients of magnitude 1. The ratio peaks at 6.57 after 12 large gradients, then decays as vt catches up. Despite this extreme construction, the ratio only reaches 66% of the theoretical bound of 10. For constant gradients (typical case), the ratio equals 1. Connection to observed… view at source ↗

**Figure 10.** Figure 10: Sparse value patching. A patch P = (I, V) consists of changed indices I and their new values V. To reconstruct Wt from Wt−1, we overwrite: Wt[I] ← V. This direct assignment requires no floating-point arithmetic, guaranteeing bit-exact reconstruction. B.1 System Overview grail separates computationally expensive inference (rollout generation) from training, enabling distributed nodes to contribute compute … view at source ↗

**Figure 11.** Figure 11: Gradient sparsity throughout training for standard GRPO across (a) model architectures and sizes, (b) iteration counts, and (c) learning rates. Sparsity is measured as the fraction of exactly-zero gradient values. Shaded regions indicate standard error across 4 seeds. Gradient sparsity remains near zero (<1%) throughout training regardless of model, iteration count, or learning rate, demonstrating that s… view at source ↗

**Figure 12.** Figure 12: Training curves across model scales. Pass@1 validation accuracy throughout training for all models used in our sparsity analysis. All models show rapid initial improvement followed by convergence within 400 steps, validating our choice of training duration. Shaded regions indicate ±1 standard error across 4 seeds. D.2 Training Curves Across Model Scales To validate that our 400-step training duration capt… view at source ↗

**Figure 13.** Figure 13: Bandwidth-aware algorithm selection. Total transfer time (encode + network + decode) for a 7B model. Shaded regions indicate the optimal algorithm per bandwidth tier. Fast algorithms like lz4 are preferred at high bandwidth, while high-ratio algorithms like zstd-3 are better for constrained links. Crossover formula. The crossover bandwidth where two algorithms A and B have equal total transfer time can be… view at source ↗

**Figure 14.** Figure 14: Checkpoint chain structure. Full checkpoints (anchors) are published every k steps; between anchors, only sparse patches are transmitted. This structure enables the fast path (single patch application) for steady-state nodes while providing recovery points for late joiners via the slow path (anchor download plus patch chain). See Algorithm 4 for the formal protocol. so the anchor interval only affects col… view at source ↗

read the original abstract

Bandwidth-constrained distributed reinforcement learning (RL) post-training of large language models is bottlenecked by two channels: weight synchronization from trainers to inference workers, and gradient or pseudo-gradient synchronization across trainers. We find that approximately 99% of per-step weight updates are invisible after the BF16 cast used by standard training and inference forward passes. We explain this sparsity by showing that, at typical RL post-training learning rates, Adam updates often fall below the local BF16 rounding threshold. We turn this observation into an algorithmic principle called compute-visible sparsification: transmit only updates that would change the next forward pass. PULSE (Precision-gated Updates for Low-precision Sparse Exchange) turns this principle into two communication algorithms: PULSESync sends lossless sparse BF16 weight patches from trainers to inference workers, and PULSELoCo sparsifies DiLoCo-style FP32 pseudo-gradient synchronization with error feedback. Over bandwidth-constrained commodity networks, PULSESync cuts weight-synchronization communication by over 100x while reconstructing trainer weights bit-identically. PULSELoCo matches DiLoCo across four models while reducing trainer-to-trainer communication by over 17x versus DiLoCo and over 100x versus DDP in the largest evaluated setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that most Adam updates fall below the BF16 rounding threshold at normal learning rates, letting you skip them for bit-identical weight sync and big communication cuts in distributed RL.

read the letter

The main thing to know is that this work turns a property of BF16 arithmetic into a practical communication trick. At the learning rates used in RL post-training, Adam updates are often smaller than the local rounding threshold, so about 99 percent of them do not change the weights that actually get used in the next forward pass. The authors call this compute-visible sparsification and build two algorithms around it: PULSESync for trainer-to-inference weight patches and PULSELoCo for sparsifying DiLoCo-style pseudo-gradient exchanges with error feedback. PULSESync claims over 100x lower communication while keeping exact bit-identical weights, and PULSELoCo matches DiLoCo performance with over 17x less trainer-to-trainer traffic in the largest setting they test. The explanation tying update magnitude to BF16 invisibility is clear and directly usable. The reported gains on commodity networks are the kind of concrete number that matters for scaling. The soft spot is the stability of that sparsity level. The abstract states the 99 percent figure at typical rates, but if the fraction of visible updates rises as training progresses or at different scales, the net savings after metadata overhead would be smaller than advertised. More detail on how sparsity evolves over full runs and on the exact cost of indexing the sparse patches would strengthen the central claim. This is for people running distributed RL post-training on bandwidth-limited setups. A reader who needs to cut synchronization costs without changing model behavior will find the approach worth trying. It deserves a serious referee because the underlying observation is reproducible from the arithmetic and the potential infrastructure impact is real, even if the long-run measurements need tighter validation.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes weight update sparsity in BF16 for distributed RL post-training of LLMs. It reports that approximately 99% of per-step Adam updates fall below the local BF16 rounding threshold at typical learning rates, making them invisible to forward passes. This leads to the principle of compute-visible sparsification, implemented in PULSESync (lossless sparse BF16 weight patches from trainers to inference workers, claiming >100x comm reduction with bit-identical reconstruction) and PULSELoCo (sparsified DiLoCo-style FP32 pseudo-gradient sync with error feedback, matching DiLoCo performance while cutting trainer-to-trainer comm by >17x vs DiLoCo and >100x vs DDP).

Significance. If the ~99% sparsity level holds stably across full training runs and generalizes, the work offers a practical route to alleviate communication bottlenecks in bandwidth-constrained distributed RL by exploiting low-precision arithmetic properties without changing training dynamics. The bit-identical reconstruction guarantee in PULSESync and the reported performance parity with DiLoCo on four models are concrete strengths that could enable larger-scale post-training on commodity networks.

major comments (1)

The central >100x communication reduction for PULSESync (and the overall algorithmic principle) rests on the assumption that per-step sparsity near 99% remains stable for the entire post-training trajectory. The abstract states results at typical RL post-training learning rates but provides no data or analysis on how sparsity evolves as gradients or effective step sizes change over time; a sustained drop even to 90% average would make the effective volume (including index/patch metadata) fall well short of 100x versus dense BF16 synchronization.

minor comments (2)

The abstract reports clear empirical outcomes but does not include error bars, number of runs, or ablation studies on sparsity stability, which would be needed to support the generalization claims.
Provide a precise definition and computation of the 'local BF16 rounding threshold' for each weight element, including any dependence on the current weight value or exponent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work's potential impact and for the constructive comment on sparsity stability. We address the major comment below and will incorporate revisions as noted.

read point-by-point responses

Referee: The central >100x communication reduction for PULSESync (and the overall algorithmic principle) rests on the assumption that per-step sparsity near 99% remains stable for the entire post-training trajectory. The abstract states results at typical RL post-training learning rates but provides no data or analysis on how sparsity evolves as gradients or effective step sizes change over time; a sustained drop even to 90% average would make the effective volume (including index/patch metadata) fall well short of 100x versus dense BF16 synchronization.

Authors: We agree that stability of the ~99% sparsity level across the full post-training trajectory is essential to substantiate the communication reduction claims. The initial manuscript reports the sparsity figure from evaluations at representative RL post-training learning rates but does not include explicit trajectory analysis. To address this directly, we have performed additional experiments on the evaluated models showing that per-step sparsity remains above 98% throughout training; it does not decrease and often increases modestly as effective step sizes diminish. We will add a new figure and accompanying analysis of sparsity evolution (including metadata overhead) to the revised manuscript to make this evidence explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on external BF16 arithmetic observation

full rationale

The paper begins from an empirical finding that ~99% of Adam updates fall below the BF16 rounding threshold at typical RL post-training rates, explains this via the interaction of learning rates and local precision, and defines compute-visible sparsification as transmitting only updates that would alter the next forward pass. This chain does not reduce any prediction or central result to a fitted parameter, self-definition, or self-citation load-bearing premise. PULSESync and PULSELoCo are direct algorithmic translations of the observed property rather than statistical fits or renamed inputs. The stability of sparsity across full trajectories is an unverified assumption that affects correctness but does not create circularity in the derivation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests primarily on the domain property of BF16 rounding and the empirical sparsity observation at typical learning rates; no free parameters, new axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption At typical RL post-training learning rates, Adam updates often fall below the local BF16 rounding threshold and become invisible after casting.
This is the load-bearing observation stated directly in the abstract as the basis for the sparsification rule.

pith-pipeline@v0.9.0 · 5750 in / 1247 out tokens · 45366 ms · 2026-05-21T13:33:02.986140+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs
cs.LG 2026-05 unverdicted novelty 7.0

AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and Agent...
SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication
cs.LG 2026-05 unverdicted novelty 4.0

SparseRL-Sync achieves lossless weight synchronization in large-scale RL by sending only changed parameters, reducing communication volume by roughly 100x under observed 99%+ element-level sparsity.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 2 Pith papers · 19 internal anchors

[1]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

work page 2022
[2]

Deep reinforcement learning from human preferences.Advances in Neural Information Pro- cessing Systems, 30, 2017

Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in Neural Information Pro- cessing Systems, 30, 2017

work page 2017
[3]

Learning to summarize from human feedback

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020

work page 2020
[4]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

RLAIF vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with AI feedback. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Mach...

work page 2024
[6]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

A technical survey of reinforcement learning techniques for large language models.arXiv preprint arXiv:2507.04136, 2025

Saksham Sahai Srivastava and Vaneet Aggarwal. A technical survey of reinforcement learning techniques for large language models.arXiv preprint arXiv:2507.04136, 2025

work page arXiv 2025
[9]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Olmo 3

Team OLMo et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework.arXiv preprint arXiv:2405.11143, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

HybridFlow: A flexible and efficient RLHF framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. InTwentieth European Conference on Computer Systems (EuroSys ’25), 2025. doi: 10.1145/ 3689031.3696075

work page arXiv 2025
[14]

NeMo-Aligner: Scalable toolkit for efficient model alignment

Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, Markel Sanz Ausin, Ashwath Aithal, and Oleksii Kuchaiev. NeMo-Aligner: Scalable toolkit for efficient model alignment. InConference on Language Modeling (COLM), 2024

work page 2024
[15]

INTELLECT-2: A reasoning model trained through globally decentralized reinforcement learning, 2025

Prime Intellect Team, Sami Jaghouar, Justus Mattern, Jack Min Ong, Jannik Straube, Manveer Basra, Aaron Pazdera, Kushal Thaman, Matthew Di Ferrante, Felix Gabriel, Fares Obeid, Kemal Erdem, Michael Keiblinger, and Johannes Hagemann. INTELLECT-2: A reasoning model trained through globally decentralized reinforcement learning, 2025. 12

work page 2025
[16]

QSGD: Communication-efficient SGD via gradient quantization and encoding

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan V ojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[17]

Deep gradient compression: Reducing the communication bandwidth for distributed training

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. InInternational Conference on Learning Representations, 2018

work page 2018
[18]

PowerSGD: Practical low-rank gra- dient compression for distributed optimization

Thijs V ogels, Sai Praneeth Karimireddy, and Martin Jaggi. PowerSGD: Practical low-rank gra- dient compression for distributed optimization. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[19]

Communication efficient LLM pre-training with SparseLoCo.arXiv preprint arXiv:2508.15706, 2025

Amir Sarfi, Benjamin Thérien, Joel Lidin, and Eugene Belilovsky. Communication efficient LLM pre-training with SparseLoCo.arXiv preprint arXiv:2508.15706, 2025

work page arXiv 2025
[20]

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InInternational Conference on Learning Representations, 2016

work page 2016
[21]

Reinforcement learning finetunes small subnetworks in large language models

Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tür, and Hao Peng. Reinforcement learning finetunes small subnetworks in large language models. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025

work page 2025
[22]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2021

work page 2021
[29]

The Art of Scaling Reinforcement Learning Compute for LLMs

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for LLMs.arXiv preprint arXiv:2510.13786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

RL's Razor: Why Online Reinforcement Learning Forgets Less

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai. The path not taken: RLVR provably learns off the principals.arXiv preprint arXiv:2511.08567, 2025

work page arXiv 2025
[32]

Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. InInternational Conference on Learning Representations, 2018

work page 2018
[33]

grail-v0: How we built a fully open, incentivized, decentralized reinforce- ment learning system

Erfan Miahi. grail-v0: How we built a fully open, incentivized, decentralized reinforce- ment learning system. Templar Research Blog, 2025. URL https://templarresearch. substack.com/p/grail-v0-how-we-built-a-fully-open. 13

work page 2025
[34]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, et al. 2 OLMo 2 furious.arXiv preprint arXiv:2501.00656, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

TRL: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020. 14 A Theoretical Analysis A.1 Formal Sparsity Definitions We provide formal definitions of the sparsity metrics used thr...

work page 2020
[39]

This process never waits for I/O, enabling multiple updates per window

Training process: Executes a tight loop that samples batches from the replay buffer and performs gradient updates. This process never waits for I/O, enabling multiple updates per window

work page
[40]

When the trainer produces a new checkpoint, it is handed off to this process without blocking

Upload process: Handles checkpoint serialization and upload asynchronously. When the trainer produces a new checkpoint, it is handed off to this process without blocking

work page
[41]

Replay buffer.The replay buffer decouples data arrival from training consumption

Download process: Fetches verified rollouts from storage at window boundaries and adds them to the replay buffer with staleness metadata. Replay buffer.The replay buffer decouples data arrival from training consumption. It stores rollouts from multiple windows, supports staleness-weighted sampling (preferring fresher data), and implements automatic evicti...

work page 2048
[42]

Sort indices in ascending order

work page
[43]

Store first index as-is (4 bytes)

work page
[44]

Store subsequent indices as differences from previous

work page
[45]

E.3 Memory Management The PULSE method requires maintaining the previous checkpoint to compute the sparse delta

Use variable-length encoding for differences This typically reduces index storage by 40–60% before zstd compression. E.3 Memory Management The PULSE method requires maintaining the previous checkpoint to compute the sparse delta. The memory overhead is minimal: • Training node: Maintains the current weights on the GPU and the previous weights in pinned CP...

work page

[1] [1]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

work page 2022

[2] [2]

Deep reinforcement learning from human preferences.Advances in Neural Information Pro- cessing Systems, 30, 2017

Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in Neural Information Pro- cessing Systems, 30, 2017

work page 2017

[3] [3]

Learning to summarize from human feedback

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020

work page 2020

[4] [4]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

RLAIF vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with AI feedback. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Mach...

work page 2024

[6] [6]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

A technical survey of reinforcement learning techniques for large language models.arXiv preprint arXiv:2507.04136, 2025

Saksham Sahai Srivastava and Vaneet Aggarwal. A technical survey of reinforcement learning techniques for large language models.arXiv preprint arXiv:2507.04136, 2025

work page arXiv 2025

[9] [9]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Olmo 3

Team OLMo et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework.arXiv preprint arXiv:2405.11143, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

HybridFlow: A flexible and efficient RLHF framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. InTwentieth European Conference on Computer Systems (EuroSys ’25), 2025. doi: 10.1145/ 3689031.3696075

work page arXiv 2025

[14] [14]

NeMo-Aligner: Scalable toolkit for efficient model alignment

Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, Markel Sanz Ausin, Ashwath Aithal, and Oleksii Kuchaiev. NeMo-Aligner: Scalable toolkit for efficient model alignment. InConference on Language Modeling (COLM), 2024

work page 2024

[15] [15]

INTELLECT-2: A reasoning model trained through globally decentralized reinforcement learning, 2025

Prime Intellect Team, Sami Jaghouar, Justus Mattern, Jack Min Ong, Jannik Straube, Manveer Basra, Aaron Pazdera, Kushal Thaman, Matthew Di Ferrante, Felix Gabriel, Fares Obeid, Kemal Erdem, Michael Keiblinger, and Johannes Hagemann. INTELLECT-2: A reasoning model trained through globally decentralized reinforcement learning, 2025. 12

work page 2025

[16] [16]

QSGD: Communication-efficient SGD via gradient quantization and encoding

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan V ojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017

[17] [17]

Deep gradient compression: Reducing the communication bandwidth for distributed training

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. InInternational Conference on Learning Representations, 2018

work page 2018

[18] [18]

PowerSGD: Practical low-rank gra- dient compression for distributed optimization

Thijs V ogels, Sai Praneeth Karimireddy, and Martin Jaggi. PowerSGD: Practical low-rank gra- dient compression for distributed optimization. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019

[19] [19]

Communication efficient LLM pre-training with SparseLoCo.arXiv preprint arXiv:2508.15706, 2025

Amir Sarfi, Benjamin Thérien, Joel Lidin, and Eugene Belilovsky. Communication efficient LLM pre-training with SparseLoCo.arXiv preprint arXiv:2508.15706, 2025

work page arXiv 2025

[20] [20]

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InInternational Conference on Learning Representations, 2016

work page 2016

[21] [21]

Reinforcement learning finetunes small subnetworks in large language models

Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tür, and Hao Peng. Reinforcement learning finetunes small subnetworks in large language models. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025

work page 2025

[22] [22]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2021

work page 2021

[29] [29]

The Art of Scaling Reinforcement Learning Compute for LLMs

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for LLMs.arXiv preprint arXiv:2510.13786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

RL's Razor: Why Online Reinforcement Learning Forgets Less

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai. The path not taken: RLVR provably learns off the principals.arXiv preprint arXiv:2511.08567, 2025

work page arXiv 2025

[32] [32]

Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. InInternational Conference on Learning Representations, 2018

work page 2018

[33] [33]

grail-v0: How we built a fully open, incentivized, decentralized reinforce- ment learning system

Erfan Miahi. grail-v0: How we built a fully open, incentivized, decentralized reinforce- ment learning system. Templar Research Blog, 2025. URL https://templarresearch. substack.com/p/grail-v0-how-we-built-a-fully-open. 13

work page 2025

[34] [34]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[35] [35]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, et al. 2 OLMo 2 furious.arXiv preprint arXiv:2501.00656, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

TRL: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020. 14 A Theoretical Analysis A.1 Formal Sparsity Definitions We provide formal definitions of the sparsity metrics used thr...

work page 2020

[39] [39]

This process never waits for I/O, enabling multiple updates per window

Training process: Executes a tight loop that samples batches from the replay buffer and performs gradient updates. This process never waits for I/O, enabling multiple updates per window

work page

[40] [40]

When the trainer produces a new checkpoint, it is handed off to this process without blocking

Upload process: Handles checkpoint serialization and upload asynchronously. When the trainer produces a new checkpoint, it is handed off to this process without blocking

work page

[41] [41]

Replay buffer.The replay buffer decouples data arrival from training consumption

Download process: Fetches verified rollouts from storage at window boundaries and adds them to the replay buffer with staleness metadata. Replay buffer.The replay buffer decouples data arrival from training consumption. It stores rollouts from multiple windows, supports staleness-weighted sampling (preferring fresher data), and implements automatic evicti...

work page 2048

[42] [42]

Sort indices in ascending order

work page

[43] [43]

Store first index as-is (4 bytes)

work page

[44] [44]

Store subsequent indices as differences from previous

work page

[45] [45]

E.3 Memory Management The PULSE method requires maintaining the previous checkpoint to compute the sparse delta

Use variable-length encoding for differences This typically reduces index storage by 40–60% before zstd compression. E.3 Memory Management The PULSE method requires maintaining the previous checkpoint to compute the sparse delta. The memory overhead is minimal: • Training node: Maintains the current weights on the GPU and the previous weights in pinned CP...

work page