SCAPE: Accurate and Efficient LLM Training with Extreme Sparse Communication

Haotian Xie; Junlin Chen; Mingkai Zheng; Zhao Zhang

arxiv: 2607.01678 · v1 · pith:IUO2D37Cnew · submitted 2026-07-02 · 💻 cs.LG · cs.DC

SCAPE: Accurate and Efficient LLM Training with Extreme Sparse Communication

Mingkai Zheng , Junlin Chen , Haotian Xie , Zhao Zhang This is my paper

Pith reviewed 2026-07-03 17:55 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords sparse communicationLLM pre-trainingdistributed optimizergradient sparsificationAdam optimizercommunication efficiencydata-parallel training

0 comments

The pith

SCAPE enables 99% sparse communication in LLM training by deriving masks from stable first-moment statistics instead of raw gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that aggressive sparsification at 90% and 99% levels remains stable for Adam-style optimizers when masks are built from first-moment statistics rather than raw gradients. This change, combined with partitioned mask generation, one-step delay for overlap, and single-buffer reconstruction of second moments, keeps training convergence, validation loss, and downstream accuracy intact. The result is measured end-to-end speedups of up to 43.3% for Llama-500M and 3.26 times per step for Llama-1.8B on 32-GPU clusters while matching dense baselines. If the claim holds, data-parallel and sharded LLM training can reduce communication volume dramatically without the instability previously seen at high sparsity.

Core claim

SCAPE derives communication masks from first-moment-based statistics, partitions mask generation across workers to align with sharding, delays mask usage by one step to overlap synchronization with computation, and reconstructs the quantities needed for second-moment updates from a single synchronized sparse buffer.

What carries the argument

first-moment-based mask construction with partitioned generation, one-step delay, and single-buffer second-moment reconstruction

If this is right

End-to-end wall-clock time for Llama-500M pre-training drops by up to 43.3% at 99% sparsity while matching dense model quality.
Validation loss curves and downstream task accuracy remain comparable to dense AdamW and AdamS at both 90% and 99% sparsity.
Per-step speedup reaches 3.26 times versus dense AdamS for Llama-1.8B under the same hardware setup.
Communication volume falls enough to support larger data-parallel degrees without proportional increases in network traffic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same first-moment masking logic could apply to other momentum-based optimizers that maintain stable first moments.
Reduced communication at 99% sparsity might allow equivalent training on clusters with lower-bandwidth interconnects.
If first moments stay informative at extreme sparsity, the approach may generalize to even larger models where communication dominates runtime.

Load-bearing premise

The first-moment statistics remain sufficiently stable to produce effective communication masks at 99% sparsity without degrading convergence for Adam-style optimizers.

What would settle it

Pre-training Llama-500M with SCAPE at 99% sparsity that produces measurably higher validation loss than a dense AdamS baseline after identical steps would show the method does not preserve quality.

Figures

Figures reproduced from arXiv: 2607.01678 by Haotian Xie, Junlin Chen, Mingkai Zheng, Zhao Zhang.

**Figure 2.** Figure 2: Megatron-LM with sharded data parallel distributed [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison between AdamW and AdamS after switch [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 5.** Figure 5: Gradient distribution of different layers in Llama-500M [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 4.** Figure 4: Gradient distribution of different layers in GPT-345M [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 7.** Figure 7: This method has two important differences from [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 7.** Figure 7: Communication in refreshing top-k mask. Each work computes a sharded top-k mask and then uses all-gather to construct the full mask. Since the usage of top-k mask is delayed by one step, asynchronously all-gather can be hidden by expensive backward computation. sparse payload communication with local non-topk weightdecay updates, then writes back the updated buffer and offloads it to the host memory for t… view at source ↗

**Figure 8.** Figure 8: Pre-training loss curves for Llama-500M A. Experiment Setup We evaluate SCAPE by pre-training GPT-345M and Llama500M on 32 NVIDIA GH200 GPUs of the Vista supercomputer [26] at the Texas Advanced Computing Center (TACC). Each Vista node consists of a Grace-Hopper architecture with one GH200 GPU, 96 GB of HBM3 memory, and an NVLinkC2C interconnect between the Grace CPU and Hopper GPU. The nodes are connec… view at source ↗

**Figure 9.** Figure 9: Pre-training loss curves for GPT-345M TABLE III: Final training and validation loss of pre-training GPT-345M METHOD TRAIN LOSS VAL. LOSS ADAMW (DENSE all-reduce) 2.80 2.76 ADAMS (DENSE all-reduce) 2.77 2.73 SCAPE (d = 0.1) 2.77 2.73 SCAPE (d = 0.01) 2.81 2.76 Surprisingly, given the same token budget, when using d = 0.1 (90% sparsity), SCAPE achieves lower training and validation loss than the dense AdamS.… view at source ↗

**Figure 10.** Figure 10: Per-step time comparison between different meth [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Strong scaling efficiency for training Llama-500M [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: Memory usage for training Llama-500M (sequence [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

read the original abstract

Communication increasingly dominates the cost of Large Language Model (LLM) pre-training, especially under data-parallel and sharded training schemes, where gradient synchronization and parameter reconstruction overhead increase with model size and system scale. Existing communication-reduction methods either sparsify raw gradients, which can be unstable for modern Adam-style optimizers at high sparsity, or quantize communication, whose savings are fundamentally bounded by bit width and often incur additional runtime overhead. We present SCAPE, a communication-efficient distributed optimizer for LLM training that exploits the stability of AdamS's first-moment to enable aggressive sparsification without loss of LLM quality. Instead of constructing masks from raw gradients, SCAPE derives them from first-moment-based statistics, partitions mask generation across workers to align with optimizer sharding, and delays mask usage by one step so that mask synchronization can overlap with computation. SCAPE also reconstructs the quantities required for second-moment updates from a single synchronized sparse buffer, avoiding an additional collective. We implement SCAPE in Megatron-LM and evaluate its convergence by pre-training GPT-345M on OpenWebText and Llama-500M on SlimPajama-6B using 32 NVIDIA GH200 GPUs on TACC Vista. In both models, SCAPE preserves training stability, validation loss, and downstream task accuracy under 90\% and 99\% sparsity. For Llama-500M, SCAPE reduces end-to-end pre-training wall-clock time by up to 43.3\% while maintaining model quality comparable to dense AdamW and AdamS. For Llama-1.8B, SCAPE achieves up to 3.26$\times$ speedup per step compared to dense AdamS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCAPE shows first-moment masks plus delay and single-buffer tricks can reach 99% sparsity in Adam training on 500M models with matching loss and 43% time cut, but the stability claim rests on limited evidence.

read the letter

SCAPE's main move is deriving communication masks from the first moment of AdamS rather than raw gradients, then partitioning mask creation across workers, delaying use by one step, and rebuilding the second moment from a single sparse buffer. That combination is not in the prior sparsification work they cite.

On GPT-345M and Llama-500M they report that 90% and 99% sparsity keeps validation loss and downstream accuracy in line with dense AdamW and AdamS. The Llama-500M run also shows up to 43% lower wall-clock time on 32 GH200 GPUs, and they give a 3.26x per-step speedup number for a 1.8B model. The Megatron-LM implementation is a practical plus.

The soft spot is the thin support for why the first-moment proxy stays reliable once 99% of its entries are zero. The paper gives no ablations on mask construction details, no error bars, and no analytic check on whether the sparse first moment still points to the right coordinates. The stress-test concern about mask error compounding into the second-moment state is reasonable given what is shown.

This is for engineers working on communication-bound LLM training. The concrete speedups and held-out task results are enough to send it to referees, though they will likely ask for the missing ablations and statistical checks.

Referee Report

3 major / 2 minor

Summary. The paper introduces SCAPE, a communication-efficient distributed optimizer for LLM pre-training. It derives communication masks from AdamS first-moment statistics rather than raw gradients, partitions mask generation across workers, delays mask application by one step to overlap with computation, and reconstructs second-moment quantities from a single sparse buffer. Experiments pre-train GPT-345M on OpenWebText and Llama-500M on SlimPajama-6B at 90% and 99% sparsity using 32 GH200 GPUs, reporting preserved validation loss, training stability, and downstream accuracy comparable to dense AdamW and AdamS, with up to 43.3% wall-clock reduction for Llama-500M and 3.26× per-step speedup for Llama-1.8B.

Significance. If the quality preservation holds, SCAPE would meaningfully reduce communication bottlenecks in data-parallel and sharded LLM training at extreme sparsity levels. The concrete speedups on production-scale hardware (Megatron-LM implementation) and evaluation on downstream tasks for two model sizes constitute practical evidence. The engineering choices—sharded mask generation and single-buffer reconstruction—are clear strengths that avoid additional collectives.

major comments (3)

[§3.2] §3.2 (first-moment mask construction): the central claim that first-moment statistics remain a faithful proxy for gradient importance at 99% sparsity lacks any analytic bound, sensitivity analysis, or ablation showing that the surviving non-zero entries continue to identify critical update directions once 99% of the vector is zeroed; this assumption directly underpins the reported preservation of validation loss and downstream accuracy.
[§4.2–4.3] §4.2–4.3 (Llama-500M 99% sparsity runs): the equivalence in validation loss and downstream accuracy is reported without error bars, multiple random seeds, or statistical tests, leaving open whether observed differences fall within run-to-run variance; this weakens verification of the no-degradation claim at the highest sparsity level.
[§3.3] §3.3 (single-buffer second-moment reconstruction): the propagation of any mask error from the delayed first-moment into the reconstructed second-moment state is not quantified, yet this step is load-bearing for optimizer state fidelity at 99% sparsity.

minor comments (2)

[Abstract] The abstract states results for Llama-1.8B but the experimental section focuses on 345M/500M models; clarify the scale at which the 3.26× per-step figure was measured.
[Figures/Tables] Figure captions and tables would benefit from explicit mention of the number of runs and any variance measures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (first-moment mask construction): the central claim that first-moment statistics remain a faithful proxy for gradient importance at 99% sparsity lacks any analytic bound, sensitivity analysis, or ablation showing that the surviving non-zero entries continue to identify critical update directions once 99% of the vector is zeroed; this assumption directly underpins the reported preservation of validation loss and downstream accuracy.

Authors: We acknowledge the absence of an analytic bound. The manuscript relies on empirical validation across GPT-345M and Llama-500M at 90% and 99% sparsity, showing preserved validation loss and downstream accuracy. In revision we will add sensitivity analysis and targeted ablations on mask construction at 99% sparsity to better characterize the surviving entries. revision: partial
Referee: [§4.2–4.3] §4.2–4.3 (Llama-500M 99% sparsity runs): the equivalence in validation loss and downstream accuracy is reported without error bars, multiple random seeds, or statistical tests, leaving open whether observed differences fall within run-to-run variance; this weakens verification of the no-degradation claim at the highest sparsity level.

Authors: We agree that multiple seeds and error bars would improve statistical rigor. The reported results used single runs due to resource limits on 32 GH200 GPUs. We will rerun the Llama-500M 99% sparsity experiments with at least three seeds, add error bars, and include basic statistical comparison in the revised manuscript. revision: yes
Referee: [§3.3] §3.3 (single-buffer second-moment reconstruction): the propagation of any mask error from the delayed first-moment into the reconstructed second-moment state is not quantified, yet this step is load-bearing for optimizer state fidelity at 99% sparsity.

Authors: We will add an analysis quantifying the effect of delayed mask errors on the reconstructed second-moment quantities, including a simple error-propagation bound or empirical measurement at 99% sparsity, to be included in §3.3 or an appendix. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an algorithmic construction (masks from first-moment statistics, one-step delay, single-buffer reconstruction) and validates it through direct empirical measurement of wall-clock time, validation loss, and downstream accuracy on held-out pre-training runs. No equations, predictions, or uniqueness claims reduce the reported outcomes to quantities fitted inside the paper or to self-citations; the central results are externally falsifiable experimental measurements rather than tautological re-expressions of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that first-moment statistics are stable enough for mask decisions at extreme sparsity; no new entities are postulated and no free parameters are explicitly fitted to target the reported quality metrics.

axioms (1)

domain assumption First-moment statistics from AdamS remain stable enough to generate effective communication masks at 99% sparsity without convergence degradation
The method replaces raw-gradient mask construction with first-moment-based statistics and claims no loss of LLM quality; this stability is invoked to justify aggressive sparsification.

pith-pipeline@v0.9.1-grok · 5842 in / 1371 out tokens · 23939 ms · 2026-07-03T17:55:02.880453+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 36 canonical work pages · 16 internal anchors

[1]

arXiv preprint arXiv:2402.00157 , year=

J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin, “Large language models for mathematical reasoning: Progresses and challenges,” 2024. [Online]. Available: https://arxiv.org/abs/2402.00157

work page arXiv 2024
[2]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,” 2021. [Online]. Available: https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

An autonomous laboratory for the accelerated synthesis of inorganic materials,

N. J. Szymanski, B. Rendy, Y . Fei, R. E. Kumar, T. He, D. Milsted, M. J. McDermott, M. Gallant, E. D. Cubuk, A. Merchant, H. Kim, A. Jain, C. J. Bartel, K. Persson, Y . Zeng, and G. Ceder, “An autonomous laboratory for the accelerated synthesis of inorganic materials,”Nature, vol. 624, no. 7990, pp. 86–91, 2023. [Online]. Available: https://doi.org/10.10...

work page doi:10.1038/s41586-023-06734-w 2023
[4]

Decoupled Weight Decay Regularization,

I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,”
[5]

Decoupled Weight Decay Regularization

[Online]. Available: https://arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv
[6]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models,” 2020. [Online]. Available: https://arxiv.org/abs/1910.02054

work page internal anchor Pith review Pith/arXiv arXiv 2020
[7]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel,

Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y . Hao, A. Mathews, and S. Li, “PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel,”
[8]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

[Online]. Available: https://arxiv.org/abs/2304.11277

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Megatron-lm,

NVIDIA, “Megatron-lm,” 2026. [Online]. Available: https://github.com/ NVIDIA/Megatron-LM

2026
[10]

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training,

Y . Lin, S. Han, H. Mao, Y . Wang, and W. J. Dally, “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training,” 2020. [Online]. Available: https://arxiv.org/abs/1712.01887

work page arXiv 2020
[11]

DeMo: Decoupled Momentum Optimization,

B. Peng, L. Chen, B. Su, J. Quesnelle, D. P. Kingma, and Q. Liu, “DeMo: Decoupled Momentum Optimization,” 2026. [Online]. Available: https://arxiv.org/abs/2411.19870

work page arXiv 2026
[13]

Near-optimal sparse allreduce for distributed deep learning,

S. Li and T. Hoefler, “Near-optimal sparse allreduce for distributed deep learning,” inProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, mar
[14]

Available: https://doi.org/10.1145/3503221.3508399

[Online]. Available: https://doi.org/10.1145/3503221.3508399

work page doi:10.1145/3503221.3508399
[15]

Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training,

M. Zheng and Z. Zhang, “Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training,” in Proceedings of Machine Learning and Systems, M. Zaharia, G. Joshi, and Y . Lin, Eds., vol. 7. MLSys, 2025. [Online]. Available: https://proceedings.mlsys.org/paper files/paper/ 2025/file/54dd9e0cff6d9214e20d97eb2a3bae49-Paper-Conference.pdf

2025
[16]

Quantized Distributed Training of Large Models with Convergence Guarantees,

I. Markov, A. Vladu, Q. Guo, and D. Alistarh, “Quantized Distributed Training of Large Models with Convergence Guarantees,” 2023. [Online]. Available: https://arxiv.org/abs/2302.02390

work page arXiv 2023
[17]

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training,

G. Wang, H. Qin, S. A. Jacobs, C. Holmes, S. Rajbhandari, O. Ruwase, F. Yan, L. Yang, and Y . He, “ZeRO++: Extremely Efficient Collective Communication for Giant Model Training,” 2023. [Online]. Available: https://arxiv.org/abs/2306.10209

work page arXiv 2023
[18]

SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training,

J. Jia, C. Xie, H. Lu, D. Wang, H. Feng, C. Zhang, B. Sun, H. Lin, Z. Zhang, X. Liu, and D. Tao, “SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training,” 2024. [Online]. Available: https://arxiv.org/abs/2410.15526

work page arXiv 2024
[19]

Sparsified SGD with Memory

S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with Memory,” 2018. [Online]. Available: https://arxiv.org/abs/1809.07599

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training,

H. Zhang, B. Wang, and L. Chen, “AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 10 719–10 7...

work page doi:10.18653/v1/2025.emnlp-main.543 2025
[21]

OpenWebText Cor- pus,

A. Gokaslan, V . Cohen, E. Pavlick, and S. Tellex, “OpenWebText Cor- pus,” http://Skylion007.github.io/OpenWebTextCorpus, 2019. [Online]. Available: https://doi.org/10.5281/zenodo.3834942

work page doi:10.5281/zenodo.3834942 2019
[22]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama,

D. Soboleva, F. Al-Khateeb, R. Myers, J. R. Steeves, J. Hestness, and N. Dey, “SlimPajama: A 627B token cleaned and deduplicated version of RedPajama,” 2023. [Online]. Available: https://huggingface. co/datasets/cerebras/SlimPajama-627B

2023
[23]

Adam: A Method for Stochastic Optimization,

D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”
[24]

Adam: A Method for Stochastic Optimization

[Online]. Available: https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Language Models are Unsupervised Multitask Learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” 2019

2019
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

A Theory on Adam Instability in Large-Scale Machine Learning,

I. Molybog, P. Albert, M. Chen, Z. DeVito, D. Esiobu, N. Goyal, P. S. Koura, S. Narang, A. Poulton, R. Silva, B. Tang, D. Liskovich, P. Xu, Y . Zhang, M. Kambadur, S. Roller, and S. Zhang, “A Theory on Adam Instability in Large-Scale Machine Learning,” 2023. [Online]. Available: https://arxiv.org/abs/2304.09871

work page arXiv 2023
[28]

Adaptive preconditioners trigger loss spikes in adam,

Z. Bai, Z. Zhou, J. Zhao, X. Li, Z. Li, F. Xiong, H. Yang, Y . Zhang, and Z.-Q. J. Xu, “Adaptive preconditioners trigger loss spikes in adam,”
[29]

Adaptive Preconditioners Trigger Loss Spikes in Adam

[Online]. Available: https://arxiv.org/abs/2506.04805

work page internal anchor Pith review Pith/arXiv arXiv
[30]

A Stochastic Approximation Method,

H. Robbins and S. Monro, “A Stochastic Approximation Method,”The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400–407, 1951. [Online]. Available: https://doi.org/10.1214/aoms/1177729586

work page doi:10.1214/aoms/1177729586 1951
[31]

Performance Analysis of Scientific Applications on an NVIDIA Grace System,

A. Ruhela, J. Cazes, J. D. McCalpin, C. Del-Castillo-Negrete, J. Li, H. Liu, H. Chen, C.-Y . Lu, K. F. Milfeld, W. Zhang, I. Wang, L. Koesterke, J. DeSantis, N. Lewis, S. Hempel, and D. Stanzione, “Performance Analysis of Scientific Applications on an NVIDIA Grace System,” inSC24-W: Workshops of the International Conference for High Performance Computing,...

work page doi:10.1109/scw63240.2024.00078 2024
[32]

2024 , month = jul, publisher =

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou, “The Language Model Evaluation Harness,” 07 2024. [Online]. Available: https://doi.org/10.52...

work page doi:10.5281/zenodo.12608602 2024
[33]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge,” 2018. [Online]. Available: https://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

The LAMBADA dataset: Word prediction requiring a broad discourse context

D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fern ´andez, “The LAMBADA dataset: Word prediction requiring a broad discourse context,” 2016. [Online]. Available: https://arxiv.org/abs/1606.06031

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

HellaSwag: Can a Machine Really Finish Your Sentence?

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “HellaSwag: Can a Machine Really Finish Your Sentence?” 2019. [Online]. Available: https://arxiv.org/abs/1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019
[36]

Measuring Massive Multitask Language Understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring Massive Multitask Language Understanding,”
[37]

Measuring Massive Multitask Language Understanding

[Online]. Available: https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2009
[38]

PIQA: Reasoning about Physical Commonsense in Natural Language

Y . Bisk, R. Zellers, R. L. Bras, J. Gao, and Y . Choi, “PIQA: Reasoning about Physical Commonsense in Natural Language,” 2019. [Online]. Available: https://arxiv.org/abs/1911.11641

work page internal anchor Pith review Pith/arXiv arXiv 2019
[39]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale,

K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y . Choi, “WinoGrande: An Adversarial Winograd Schema Challenge at Scale,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8732–8740. [Online]. Available: https://doi.org/10.1609/aaai. v34i05.6399

work page doi:10.1609/aaai 2020
[40]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering,

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018, pp. 2381–2391. [Online]. Available: https://doi.org/10.18653/v1/D18-1260

work page doi:10.18653/v1/d18-1260 2018
[41]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Superglue: A stickier benchmark for general-purpose language understanding systems,” 2020. [Online]. Available: https://arxiv.org/abs/1905.00537

work page internal anchor Pith review Pith/arXiv arXiv 2020
[42]

H2O-Danube3 Technical Report,

P. Pfeiffer, P. Singer, Y . Babakhin, G. Fodor, N. Dhankhar, and S. S. Ambati, “H2O-Danube3 Technical Report,” 2024. [Online]. Available: https://arxiv.org/abs/2407.09276

work page arXiv 2024
[43]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Koren...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training,

Q. Yi, J. Duan, H. Hu, Q. Hua, H. Zhao, S. Qian, D. Yang, J. Cao, J. Tang, Y . Yu, C. Liao, K. Wang, and L. Zhang, “EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training,” 2025. [Online]. Available: https://arxiv.org/abs/2511.10333

work page arXiv 2025
[45]

ATOMO: Communication-efficient Learning via Atomic Sparsification

H. Wang, S. Sievert, Z. Charles, S. Liu, S. Wright, and D. Papailiopoulos, “ATOMO: Communication-efficient Learning via Atomic Sparsification,” 2018. [Online]. Available: https: //arxiv.org/abs/1806.04090

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization,

T. V ogels, S. P. Karimireddy, and M. Jaggi, “PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization,” 2020. [Online]. Available: https://arxiv.org/abs/1905.13727

work page arXiv 2020
[47]

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression,

J. Song, J. Yim, J. Jung, H. Jang, H.-J. Kim, Y . Kim, and J. Lee, “Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression,” 2023. [Online]. Available: https://arxiv.org/abs/2301.09830

work page arXiv 2023

[1] [1]

arXiv preprint arXiv:2402.00157 , year=

J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin, “Large language models for mathematical reasoning: Progresses and challenges,” 2024. [Online]. Available: https://arxiv.org/abs/2402.00157

work page arXiv 2024

[2] [2]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,” 2021. [Online]. Available: https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

An autonomous laboratory for the accelerated synthesis of inorganic materials,

N. J. Szymanski, B. Rendy, Y . Fei, R. E. Kumar, T. He, D. Milsted, M. J. McDermott, M. Gallant, E. D. Cubuk, A. Merchant, H. Kim, A. Jain, C. J. Bartel, K. Persson, Y . Zeng, and G. Ceder, “An autonomous laboratory for the accelerated synthesis of inorganic materials,”Nature, vol. 624, no. 7990, pp. 86–91, 2023. [Online]. Available: https://doi.org/10.10...

work page doi:10.1038/s41586-023-06734-w 2023

[4] [4]

Decoupled Weight Decay Regularization,

I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,”

[5] [5]

Decoupled Weight Decay Regularization

[Online]. Available: https://arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models,” 2020. [Online]. Available: https://arxiv.org/abs/1910.02054

work page internal anchor Pith review Pith/arXiv arXiv 2020

[7] [7]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel,

Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y . Hao, A. Mathews, and S. Li, “PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel,”

[8] [8]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

[Online]. Available: https://arxiv.org/abs/2304.11277

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Megatron-lm,

NVIDIA, “Megatron-lm,” 2026. [Online]. Available: https://github.com/ NVIDIA/Megatron-LM

2026

[10] [10]

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training,

Y . Lin, S. Han, H. Mao, Y . Wang, and W. J. Dally, “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training,” 2020. [Online]. Available: https://arxiv.org/abs/1712.01887

work page arXiv 2020

[11] [11]

DeMo: Decoupled Momentum Optimization,

B. Peng, L. Chen, B. Su, J. Quesnelle, D. P. Kingma, and Q. Liu, “DeMo: Decoupled Momentum Optimization,” 2026. [Online]. Available: https://arxiv.org/abs/2411.19870

work page arXiv 2026

[12] [13]

Near-optimal sparse allreduce for distributed deep learning,

S. Li and T. Hoefler, “Near-optimal sparse allreduce for distributed deep learning,” inProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, mar

[13] [14]

Available: https://doi.org/10.1145/3503221.3508399

[Online]. Available: https://doi.org/10.1145/3503221.3508399

work page doi:10.1145/3503221.3508399

[14] [15]

Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training,

M. Zheng and Z. Zhang, “Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training,” in Proceedings of Machine Learning and Systems, M. Zaharia, G. Joshi, and Y . Lin, Eds., vol. 7. MLSys, 2025. [Online]. Available: https://proceedings.mlsys.org/paper files/paper/ 2025/file/54dd9e0cff6d9214e20d97eb2a3bae49-Paper-Conference.pdf

2025

[15] [16]

Quantized Distributed Training of Large Models with Convergence Guarantees,

I. Markov, A. Vladu, Q. Guo, and D. Alistarh, “Quantized Distributed Training of Large Models with Convergence Guarantees,” 2023. [Online]. Available: https://arxiv.org/abs/2302.02390

work page arXiv 2023

[16] [17]

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training,

G. Wang, H. Qin, S. A. Jacobs, C. Holmes, S. Rajbhandari, O. Ruwase, F. Yan, L. Yang, and Y . He, “ZeRO++: Extremely Efficient Collective Communication for Giant Model Training,” 2023. [Online]. Available: https://arxiv.org/abs/2306.10209

work page arXiv 2023

[17] [18]

SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training,

J. Jia, C. Xie, H. Lu, D. Wang, H. Feng, C. Zhang, B. Sun, H. Lin, Z. Zhang, X. Liu, and D. Tao, “SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training,” 2024. [Online]. Available: https://arxiv.org/abs/2410.15526

work page arXiv 2024

[18] [19]

Sparsified SGD with Memory

S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with Memory,” 2018. [Online]. Available: https://arxiv.org/abs/1809.07599

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [20]

AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training,

H. Zhang, B. Wang, and L. Chen, “AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 10 719–10 7...

work page doi:10.18653/v1/2025.emnlp-main.543 2025

[20] [21]

OpenWebText Cor- pus,

A. Gokaslan, V . Cohen, E. Pavlick, and S. Tellex, “OpenWebText Cor- pus,” http://Skylion007.github.io/OpenWebTextCorpus, 2019. [Online]. Available: https://doi.org/10.5281/zenodo.3834942

work page doi:10.5281/zenodo.3834942 2019

[21] [22]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama,

D. Soboleva, F. Al-Khateeb, R. Myers, J. R. Steeves, J. Hestness, and N. Dey, “SlimPajama: A 627B token cleaned and deduplicated version of RedPajama,” 2023. [Online]. Available: https://huggingface. co/datasets/cerebras/SlimPajama-627B

2023

[22] [23]

Adam: A Method for Stochastic Optimization,

D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”

[23] [24]

Adam: A Method for Stochastic Optimization

[Online]. Available: https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv

[24] [25]

Language Models are Unsupervised Multitask Learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” 2019

2019

[25] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [27]

A Theory on Adam Instability in Large-Scale Machine Learning,

I. Molybog, P. Albert, M. Chen, Z. DeVito, D. Esiobu, N. Goyal, P. S. Koura, S. Narang, A. Poulton, R. Silva, B. Tang, D. Liskovich, P. Xu, Y . Zhang, M. Kambadur, S. Roller, and S. Zhang, “A Theory on Adam Instability in Large-Scale Machine Learning,” 2023. [Online]. Available: https://arxiv.org/abs/2304.09871

work page arXiv 2023

[27] [28]

Adaptive preconditioners trigger loss spikes in adam,

Z. Bai, Z. Zhou, J. Zhao, X. Li, Z. Li, F. Xiong, H. Yang, Y . Zhang, and Z.-Q. J. Xu, “Adaptive preconditioners trigger loss spikes in adam,”

[28] [29]

Adaptive Preconditioners Trigger Loss Spikes in Adam

[Online]. Available: https://arxiv.org/abs/2506.04805

work page internal anchor Pith review Pith/arXiv arXiv

[29] [30]

A Stochastic Approximation Method,

H. Robbins and S. Monro, “A Stochastic Approximation Method,”The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400–407, 1951. [Online]. Available: https://doi.org/10.1214/aoms/1177729586

work page doi:10.1214/aoms/1177729586 1951

[30] [31]

Performance Analysis of Scientific Applications on an NVIDIA Grace System,

A. Ruhela, J. Cazes, J. D. McCalpin, C. Del-Castillo-Negrete, J. Li, H. Liu, H. Chen, C.-Y . Lu, K. F. Milfeld, W. Zhang, I. Wang, L. Koesterke, J. DeSantis, N. Lewis, S. Hempel, and D. Stanzione, “Performance Analysis of Scientific Applications on an NVIDIA Grace System,” inSC24-W: Workshops of the International Conference for High Performance Computing,...

work page doi:10.1109/scw63240.2024.00078 2024

[31] [32]

2024 , month = jul, publisher =

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou, “The Language Model Evaluation Harness,” 07 2024. [Online]. Available: https://doi.org/10.52...

work page doi:10.5281/zenodo.12608602 2024

[32] [33]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge,” 2018. [Online]. Available: https://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[33] [34]

The LAMBADA dataset: Word prediction requiring a broad discourse context

D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fern ´andez, “The LAMBADA dataset: Word prediction requiring a broad discourse context,” 2016. [Online]. Available: https://arxiv.org/abs/1606.06031

work page internal anchor Pith review Pith/arXiv arXiv 2016

[34] [35]

HellaSwag: Can a Machine Really Finish Your Sentence?

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “HellaSwag: Can a Machine Really Finish Your Sentence?” 2019. [Online]. Available: https://arxiv.org/abs/1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019

[35] [36]

Measuring Massive Multitask Language Understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring Massive Multitask Language Understanding,”

[36] [37]

Measuring Massive Multitask Language Understanding

[Online]. Available: https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2009

[37] [38]

PIQA: Reasoning about Physical Commonsense in Natural Language

Y . Bisk, R. Zellers, R. L. Bras, J. Gao, and Y . Choi, “PIQA: Reasoning about Physical Commonsense in Natural Language,” 2019. [Online]. Available: https://arxiv.org/abs/1911.11641

work page internal anchor Pith review Pith/arXiv arXiv 2019

[38] [39]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale,

K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y . Choi, “WinoGrande: An Adversarial Winograd Schema Challenge at Scale,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8732–8740. [Online]. Available: https://doi.org/10.1609/aaai. v34i05.6399

work page doi:10.1609/aaai 2020

[39] [40]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering,

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018, pp. 2381–2391. [Online]. Available: https://doi.org/10.18653/v1/D18-1260

work page doi:10.18653/v1/d18-1260 2018

[40] [41]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Superglue: A stickier benchmark for general-purpose language understanding systems,” 2020. [Online]. Available: https://arxiv.org/abs/1905.00537

work page internal anchor Pith review Pith/arXiv arXiv 2020

[41] [42]

H2O-Danube3 Technical Report,

P. Pfeiffer, P. Singer, Y . Babakhin, G. Fodor, N. Dhankhar, and S. S. Ambati, “H2O-Danube3 Technical Report,” 2024. [Online]. Available: https://arxiv.org/abs/2407.09276

work page arXiv 2024

[42] [43]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Koren...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [44]

EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training,

Q. Yi, J. Duan, H. Hu, Q. Hua, H. Zhao, S. Qian, D. Yang, J. Cao, J. Tang, Y . Yu, C. Liao, K. Wang, and L. Zhang, “EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training,” 2025. [Online]. Available: https://arxiv.org/abs/2511.10333

work page arXiv 2025

[44] [45]

ATOMO: Communication-efficient Learning via Atomic Sparsification

H. Wang, S. Sievert, Z. Charles, S. Liu, S. Wright, and D. Papailiopoulos, “ATOMO: Communication-efficient Learning via Atomic Sparsification,” 2018. [Online]. Available: https: //arxiv.org/abs/1806.04090

work page internal anchor Pith review Pith/arXiv arXiv 2018

[45] [46]

PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization,

T. V ogels, S. P. Karimireddy, and M. Jaggi, “PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization,” 2020. [Online]. Available: https://arxiv.org/abs/1905.13727

work page arXiv 2020

[46] [47]

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression,

J. Song, J. Yim, J. Jung, H. Jang, H.-J. Kim, Y . Kim, and J. Lee, “Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression,” 2023. [Online]. Available: https://arxiv.org/abs/2301.09830

work page arXiv 2023