pith. sign in

arxiv: 2602.03839 · v2 · pith:22A2GXQDnew · submitted 2026-02-03 · 💻 cs.LG

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

Pith reviewed 2026-05-21 13:33 UTC · model grok-4.3

classification 💻 cs.LG
keywords weight update sparsitydistributed reinforcement learningcommunication efficiencyBF16 precisionAdam optimizerPULSESyncPULSELoCoDiLoCo
0
0 comments X

The pith

Most per-step Adam updates in BF16 RL training fall below the rounding threshold and never affect the forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that at typical learning rates used for reinforcement learning post-training of large models, roughly 99 percent of Adam weight updates are smaller than the local rounding error introduced by BF16 casting. Because these tiny changes produce no difference in the next forward pass, they can be skipped entirely when moving weights or pseudo-gradients between machines. The authors formalize this as compute-visible sparsification and implement it in two algorithms: PULSESync for lossless sparse BF16 weight patches sent from trainers to inference workers, and PULSELoCo for sparsified DiLoCo-style exchanges among trainers with error feedback. If the observation holds, distributed RL on commodity networks becomes feasible at scales previously limited by bandwidth, while preserving bit-identical weights and unchanged training dynamics.

Core claim

At the learning rates standard in RL post-training, Adam updates frequently lie below the BF16 rounding threshold, rendering approximately 99 percent of them invisible to subsequent forward passes. PULSESync exploits this by transmitting only the sparse BF16 patches that would alter the next computation, achieving over 100x reduction in trainer-to-inference communication while reconstructing trainer weights bit-identically at the inference workers. PULSELoCo applies the same principle with error feedback to pseudo-gradient synchronization, matching DiLoCo performance on four models while cutting trainer-to-trainer traffic by more than 17x versus DiLoCo and over 100x versus DDP in the largest

What carries the argument

Compute-visible sparsification: transmit only those weight or pseudo-gradient updates that would change the result of the next forward pass after BF16 casting.

If this is right

  • PULSESync delivers over 100x lower weight-synchronization volume while guaranteeing bit-identical weights at inference workers.
  • PULSELoCo matches full DiLoCo performance on multiple models with over 17x less trainer-to-trainer communication.
  • The same sparsification principle can be applied to both weight synchronization and pseudo-gradient exchanges in bandwidth-constrained settings.
  • Communication costs drop by more than 100x versus standard DDP in the largest evaluated configurations without loss of training fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rounding-threshold argument could be tested in other low-precision formats or optimizers to see whether comparable sparsity appears outside RL post-training.
  • If the sparsity pattern proves stable, the technique might extend naturally to inference-only serving clusters that also run occasional fine-tuning steps.
  • Hardware that natively supports sparse BF16 updates could further amplify the observed bandwidth savings.

Load-bearing premise

The observed 99 percent sparsity below the BF16 rounding threshold stays consistent across training steps and can be exploited without changing overall training dynamics.

What would settle it

A step-by-step measurement across an entire training run that shows either far fewer than 99 percent of updates fall below the BF16 threshold or that applying the sparsification produces measurable differences in final model quality or convergence speed.

Figures

Figures reproduced from arXiv: 2602.03839 by Erfan Miahi, Eugene Belilovsky.

Figure 1
Figure 1. Figure 1: Compute utilization vs. network bandwidth for a 7B model with 50s/step compute time. Full weight synchronization (14 GB) requires 20 Gbit/s links for 90% GPU utilization. PULSE reduces this to 0.2 Gbit/s (a 100× reduction) by transmitting only the 1% of parameters that change (140 MB). This enables efficient training over standard network connections, where the global median fixed broadband speed is low. W… view at source ↗
Figure 2
Figure 2. Figure 2: Weight update sparsity across model scales and families. (a) Mean per-step sparsity (%) averaged over 400 training steps. Error bars indicate ±1 standard deviation across steps. (b) Sparsity when comparing θt to θt+k for increasing k. Shaded regions indicate ±1 standard deviation. Within the recommended k ≤ 8 range for asynchronous RL [29], sparsity remains above 98% for all models. minimum sparsity (worst… view at source ↗
Figure 3
Figure 3. Figure 3: Why most weights cannot be updated in BF16. The diagonal line shows the minimum update size needed to change a weight (larger weights require larger updates). Horizontal lines show Adam update bounds at learning rate 3 × 10−6 : the effective bound (η) and the absorption bound (10η). The shaded region marks weights beyond the absorption bound, which are permanently frozen. Gray dots show that most LLM weigh… view at source ↗
Figure 4
Figure 4. Figure 4: b shows how both policy staleness and measurement granularity affect sparsity. For per-step updates (k = 1), sparsity remains above 98.5% even at 32-step staleness. When measuring over longer intervals (higher k), sparsity decreases as more parameters accumulate changes, but remains above 97.5% across all conditions tested. The shaded error bands show that variance increases modestly with both staleness an… view at source ↗
Figure 5
Figure 5. Figure 5: PULSE synchronization topology. Training nodes use high-bandwidth interconnects for dense gradient communication. Sparse weight patches are published to a central relay, enabling inference nodes to synchronize over commodity networks. using narrower integer types (e.g., uint16 instead of int32). Together, these contribute ∼23% additional compression beyond the sparse representation alone (Section E.4). Dec… view at source ↗
Figure 6
Figure 6. Figure 6: Training progress with PULSE. Validation pass@1 (colored lines) improves steadily while upload sizes (gray lines) remain stable throughout training. Each training window corresponds to approximately 6 minutes, during which up to 8 gradient steps may occur; the exact count varies due to the asynchronous nature of the system. The dashed line indicates the mean upload size of 108 MB, representing more than 10… view at source ↗
Figure 7
Figure 7. Figure 7: Update absorption in BF16 arithmetic. BF16 can represent only 128 distinct values between consecutive powers of two (shown as tick marks). Updates falling in the red zone round back to the original value and are absorbed; updates in the green zone survive. With learning rate 3 × 10−6 and weight w = 1.0, a typical update ∆w ≈ 3 × 10−6 falls deep in the absorption zone (ratio 3 × 10−6 ≪ 1/256). In mixed-prec… view at source ↗
Figure 8
Figure 8. Figure 8: Sparsity with mixed-precision training (FP32 master weights, BF16 computation). Training Qwen2.5-1.5B-Instruct with GRPO on MATH tasks. Validation pass@1 improves steadily while weight update sparsity (measured by casting FP32 master weights to BF16 and comparing consecutive steps) remains consistently above 99.4%. Shaded regions indicate ±1 standard deviation across 4 seeds. where pi , qi ≥ 0 are the weig… view at source ↗
Figure 9
Figure 9. Figure 9: Ratio |mˆ t|/ √ vˆt for an adversarial gradient sequence. The sequence consists of 105 near-zero gradients followed by constant gradients of magnitude 1. The ratio peaks at 6.57 after 12 large gradients, then decays as vt catches up. Despite this extreme construction, the ratio only reaches 66% of the theoretical bound of 10. For constant gradients (typical case), the ratio equals 1. Connection to observed… view at source ↗
Figure 10
Figure 10. Figure 10: Sparse value patching. A patch P = (I, V) consists of changed indices I and their new values V. To reconstruct Wt from Wt−1, we overwrite: Wt[I] ← V. This direct assignment requires no floating-point arithmetic, guaranteeing bit-exact reconstruction. B.1 System Overview grail separates computationally expensive inference (rollout generation) from training, enabling distributed nodes to contribute compute … view at source ↗
Figure 11
Figure 11. Figure 11: Gradient sparsity throughout training for standard GRPO across (a) model architec￾tures and sizes, (b) iteration counts, and (c) learning rates. Sparsity is measured as the fraction of exactly-zero gradient values. Shaded regions indicate standard error across 4 seeds. Gradient sparsity remains near zero (<1%) throughout training regardless of model, iteration count, or learning rate, demonstrating that s… view at source ↗
Figure 12
Figure 12. Figure 12: Training curves across model scales. Pass@1 validation accuracy throughout training for all models used in our sparsity analysis. All models show rapid initial improvement followed by convergence within 400 steps, validating our choice of training duration. Shaded regions indicate ±1 standard error across 4 seeds. D.2 Training Curves Across Model Scales To validate that our 400-step training duration capt… view at source ↗
Figure 13
Figure 13. Figure 13: Bandwidth-aware algorithm selection. Total transfer time (encode + network + decode) for a 7B model. Shaded regions indicate the optimal algorithm per bandwidth tier. Fast algorithms like lz4 are preferred at high bandwidth, while high-ratio algorithms like zstd-3 are better for constrained links. Crossover formula. The crossover bandwidth where two algorithms A and B have equal total transfer time can be… view at source ↗
Figure 14
Figure 14. Figure 14: Checkpoint chain structure. Full checkpoints (anchors) are published every k steps; between anchors, only sparse patches are transmitted. This structure enables the fast path (single patch application) for steady-state nodes while providing recovery points for late joiners via the slow path (anchor download plus patch chain). See Algorithm 4 for the formal protocol. so the anchor interval only affects col… view at source ↗
read the original abstract

Bandwidth-constrained distributed reinforcement learning (RL) post-training of large language models is bottlenecked by two channels: weight synchronization from trainers to inference workers, and gradient or pseudo-gradient synchronization across trainers. We find that approximately 99% of per-step weight updates are invisible after the BF16 cast used by standard training and inference forward passes. We explain this sparsity by showing that, at typical RL post-training learning rates, Adam updates often fall below the local BF16 rounding threshold. We turn this observation into an algorithmic principle called compute-visible sparsification: transmit only updates that would change the next forward pass. PULSE (Precision-gated Updates for Low-precision Sparse Exchange) turns this principle into two communication algorithms: PULSESync sends lossless sparse BF16 weight patches from trainers to inference workers, and PULSELoCo sparsifies DiLoCo-style FP32 pseudo-gradient synchronization with error feedback. Over bandwidth-constrained commodity networks, PULSESync cuts weight-synchronization communication by over 100x while reconstructing trainer weights bit-identically. PULSELoCo matches DiLoCo across four models while reducing trainer-to-trainer communication by over 17x versus DiLoCo and over 100x versus DDP in the largest evaluated setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes weight update sparsity in BF16 for distributed RL post-training of LLMs. It reports that approximately 99% of per-step Adam updates fall below the local BF16 rounding threshold at typical learning rates, making them invisible to forward passes. This leads to the principle of compute-visible sparsification, implemented in PULSESync (lossless sparse BF16 weight patches from trainers to inference workers, claiming >100x comm reduction with bit-identical reconstruction) and PULSELoCo (sparsified DiLoCo-style FP32 pseudo-gradient sync with error feedback, matching DiLoCo performance while cutting trainer-to-trainer comm by >17x vs DiLoCo and >100x vs DDP).

Significance. If the ~99% sparsity level holds stably across full training runs and generalizes, the work offers a practical route to alleviate communication bottlenecks in bandwidth-constrained distributed RL by exploiting low-precision arithmetic properties without changing training dynamics. The bit-identical reconstruction guarantee in PULSESync and the reported performance parity with DiLoCo on four models are concrete strengths that could enable larger-scale post-training on commodity networks.

major comments (1)
  1. The central >100x communication reduction for PULSESync (and the overall algorithmic principle) rests on the assumption that per-step sparsity near 99% remains stable for the entire post-training trajectory. The abstract states results at typical RL post-training learning rates but provides no data or analysis on how sparsity evolves as gradients or effective step sizes change over time; a sustained drop even to 90% average would make the effective volume (including index/patch metadata) fall well short of 100x versus dense BF16 synchronization.
minor comments (2)
  1. The abstract reports clear empirical outcomes but does not include error bars, number of runs, or ablation studies on sparsity stability, which would be needed to support the generalization claims.
  2. Provide a precise definition and computation of the 'local BF16 rounding threshold' for each weight element, including any dependence on the current weight value or exponent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work's potential impact and for the constructive comment on sparsity stability. We address the major comment below and will incorporate revisions as noted.

read point-by-point responses
  1. Referee: The central >100x communication reduction for PULSESync (and the overall algorithmic principle) rests on the assumption that per-step sparsity near 99% remains stable for the entire post-training trajectory. The abstract states results at typical RL post-training learning rates but provides no data or analysis on how sparsity evolves as gradients or effective step sizes change over time; a sustained drop even to 90% average would make the effective volume (including index/patch metadata) fall well short of 100x versus dense BF16 synchronization.

    Authors: We agree that stability of the ~99% sparsity level across the full post-training trajectory is essential to substantiate the communication reduction claims. The initial manuscript reports the sparsity figure from evaluations at representative RL post-training learning rates but does not include explicit trajectory analysis. To address this directly, we have performed additional experiments on the evaluated models showing that per-step sparsity remains above 98% throughout training; it does not decrease and often increases modestly as effective step sizes diminish. We will add a new figure and accompanying analysis of sparsity evolution (including metadata overhead) to the revised manuscript to make this evidence explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on external BF16 arithmetic observation

full rationale

The paper begins from an empirical finding that ~99% of Adam updates fall below the BF16 rounding threshold at typical RL post-training rates, explains this via the interaction of learning rates and local precision, and defines compute-visible sparsification as transmitting only updates that would alter the next forward pass. This chain does not reduce any prediction or central result to a fitted parameter, self-definition, or self-citation load-bearing premise. PULSESync and PULSELoCo are direct algorithmic translations of the observed property rather than statistical fits or renamed inputs. The stability of sparsity across full trajectories is an unverified assumption that affects correctness but does not create circularity in the derivation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests primarily on the domain property of BF16 rounding and the empirical sparsity observation at typical learning rates; no free parameters, new axioms, or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption At typical RL post-training learning rates, Adam updates often fall below the local BF16 rounding threshold and become invisible after casting.
    This is the load-bearing observation stated directly in the abstract as the basis for the sparsification rule.

pith-pipeline@v0.9.0 · 5750 in / 1247 out tokens · 45366 ms · 2026-05-21T13:33:02.986140+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and Agent...

  2. SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication

    cs.LG 2026-05 unverdicted novelty 4.0

    SparseRL-Sync achieves lossless weight synchronization in large-scale RL by sending only changed parameters, reducing communication volume by roughly 100x under observed 99%+ element-level sparsity.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 2 Pith papers · 19 internal anchors

  1. [1]

    Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  2. [2]

    Deep reinforcement learning from human preferences.Advances in Neural Information Pro- cessing Systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in Neural Information Pro- cessing Systems, 30, 2017

  3. [3]

    Learning to summarize from human feedback

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020

  4. [4]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

  5. [5]

    RLAIF vs

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with AI feedback. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Mach...

  6. [6]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  8. [8]

    A technical survey of reinforcement learning techniques for large language models.arXiv preprint arXiv:2507.04136, 2025

    Saksham Sahai Srivastava and Vaneet Aggarwal. A technical survey of reinforcement learning techniques for large language models.arXiv preprint arXiv:2507.04136, 2025

  9. [9]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  10. [10]

    Olmo 3

    Team OLMo et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  12. [12]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework.arXiv preprint arXiv:2405.11143, 2024

  13. [13]

    HybridFlow: A flexible and efficient RLHF framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. InTwentieth European Conference on Computer Systems (EuroSys ’25), 2025. doi: 10.1145/ 3689031.3696075

  14. [14]

    NeMo-Aligner: Scalable toolkit for efficient model alignment

    Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, Markel Sanz Ausin, Ashwath Aithal, and Oleksii Kuchaiev. NeMo-Aligner: Scalable toolkit for efficient model alignment. InConference on Language Modeling (COLM), 2024

  15. [15]

    INTELLECT-2: A reasoning model trained through globally decentralized reinforcement learning, 2025

    Prime Intellect Team, Sami Jaghouar, Justus Mattern, Jack Min Ong, Jannik Straube, Manveer Basra, Aaron Pazdera, Kushal Thaman, Matthew Di Ferrante, Felix Gabriel, Fares Obeid, Kemal Erdem, Michael Keiblinger, and Johannes Hagemann. INTELLECT-2: A reasoning model trained through globally decentralized reinforcement learning, 2025. 12

  16. [16]

    QSGD: Communication-efficient SGD via gradient quantization and encoding

    Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan V ojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. InAdvances in Neural Information Processing Systems, volume 30, 2017

  17. [17]

    Deep gradient compression: Reducing the communication bandwidth for distributed training

    Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. InInternational Conference on Learning Representations, 2018

  18. [18]

    PowerSGD: Practical low-rank gra- dient compression for distributed optimization

    Thijs V ogels, Sai Praneeth Karimireddy, and Martin Jaggi. PowerSGD: Practical low-rank gra- dient compression for distributed optimization. InAdvances in Neural Information Processing Systems, volume 32, 2019

  19. [19]

    Communication efficient LLM pre-training with SparseLoCo.arXiv preprint arXiv:2508.15706, 2025

    Amir Sarfi, Benjamin Thérien, Joel Lidin, and Eugene Belilovsky. Communication efficient LLM pre-training with SparseLoCo.arXiv preprint arXiv:2508.15706, 2025

  20. [20]

    Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

    Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InInternational Conference on Learning Representations, 2016

  21. [21]

    Reinforcement learning finetunes small subnetworks in large language models

    Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tür, and Hao Peng. Reinforcement learning finetunes small subnetworks in large language models. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025

  22. [22]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  23. [23]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  24. [24]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  25. [25]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  26. [26]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  27. [27]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  28. [28]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2021

  29. [29]

    The Art of Scaling Reinforcement Learning Compute for LLMs

    Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for LLMs.arXiv preprint arXiv:2510.13786, 2025

  30. [30]

    RL's Razor: Why Online Reinforcement Learning Forgets Less

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

  31. [31]

    Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai

    Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai. The path not taken: RLVR provably learns off the principals.arXiv preprint arXiv:2511.08567, 2025

  32. [32]

    Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. InInternational Conference on Learning Representations, 2018

  33. [33]

    grail-v0: How we built a fully open, incentivized, decentralized reinforce- ment learning system

    Erfan Miahi. grail-v0: How we built a fully open, incentivized, decentralized reinforce- ment learning system. Templar Research Blog, 2025. URL https://templarresearch. substack.com/p/grail-v0-how-we-built-a-fully-open. 13

  34. [34]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  35. [35]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  36. [36]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

  37. [37]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, et al. 2 OLMo 2 furious.arXiv preprint arXiv:2501.00656, 2025

  38. [38]

    TRL: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020. 14 A Theoretical Analysis A.1 Formal Sparsity Definitions We provide formal definitions of the sparsity metrics used thr...

  39. [39]

    This process never waits for I/O, enabling multiple updates per window

    Training process: Executes a tight loop that samples batches from the replay buffer and performs gradient updates. This process never waits for I/O, enabling multiple updates per window

  40. [40]

    When the trainer produces a new checkpoint, it is handed off to this process without blocking

    Upload process: Handles checkpoint serialization and upload asynchronously. When the trainer produces a new checkpoint, it is handed off to this process without blocking

  41. [41]

    Replay buffer.The replay buffer decouples data arrival from training consumption

    Download process: Fetches verified rollouts from storage at window boundaries and adds them to the replay buffer with staleness metadata. Replay buffer.The replay buffer decouples data arrival from training consumption. It stores rollouts from multiple windows, supports staleness-weighted sampling (preferring fresher data), and implements automatic evicti...

  42. [42]

    Sort indices in ascending order

  43. [43]

    Store first index as-is (4 bytes)

  44. [44]

    Store subsequent indices as differences from previous

  45. [45]

    E.3 Memory Management The PULSE method requires maintaining the previous checkpoint to compute the sparse delta

    Use variable-length encoding for differences This typically reduces index storage by 40–60% before zstd compression. E.3 Memory Management The PULSE method requires maintaining the previous checkpoint to compute the sparse delta. The memory overhead is minimal: • Training node: Maintains the current weights on the GPU and the previous weights in pinned CP...