GNMR: Runtime Stability Control for Low-Precision Large Language Model Training

Boao Kong; Engao Zhang; Guohong Li; Kun Yuan; Weichen Jia; Yao Wang; Yaoyuan Wang; Yonghan Dong; Yunke Peng

arxiv: 2606.00539 · v1 · pith:3R3RAISTnew · submitted 2026-05-30 · 💻 cs.LG · math.OC· stat.ML

GNMR: Runtime Stability Control for Low-Precision Large Language Model Training

Boao Kong , Weichen Jia , Engao Zhang , Guohong Li , Yonghan Dong , Yao Wang , Yaoyuan Wang , Yunke Peng

show 1 more author

Kun Yuan

This is my paper

Pith reviewed 2026-06-28 18:51 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords GNMRlow-precision trainingLLMstability controlgradient normrecovery actionsquantization

0 comments

The pith

GNMR detects numerical risks in low-precision LLM training by comparing gradient norms to their historical means and triggers budgeted recovery actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GNMR as a lightweight controller for maintaining stability during low-precision training of large language models. It formulates runtime stability control as monitoring gradient norms against historical averages and short-term deltas. Risk signals lead to bounded recovery actions under a maximum operations budget and lock intervals. This is done without modifying the numerical format, kernels, or backend recipes. Experiments across activation quantization, recipe-level training, and LLaMA-2 fine-tuning demonstrate preserved model quality with sparse interventions.

Core claim

GNMR is a backend-agnostic controller that maps local gradient norm signals to recovery actions under hard maxO budget and short lock interval, preserving high-fidelity quality in low-precision training with sparse, budgeted recovery.

What carries the argument

The Gradient Norm-to-Mean Ratio (GNMR) and its delta variant, which compare current gradient norms to historical means to signal numerical risk and initiate recovery.

If this is right

Low-precision paths can be used more reliably without frequent numerical issues.
Recovery is sparse and budgeted, minimizing impact on training efficiency.
Quality remains high-fidelity in various training scenarios including fine-tuning.
The controller works without changes to existing numerical formats or kernels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such controllers might be combined with other monitoring techniques for broader coverage.
The approach may apply to other types of numerical instabilities in deep learning beyond gradients.

Load-bearing premise

That the gradient norm relative to its historical mean provides a reliable signal of numerical risk correctable by bounded recovery without degrading final quality.

What would settle it

A case where applying GNMR recovery leads to lower final model quality than training without it, or undetected instability causes failure despite GNMR monitoring.

read the original abstract

Training stability is a key bottleneck in low-precision language model training: efficient low-cost paths can still produce short-lived numerical risks at a small set of operators. We formulate this as runtime stability control and present Gradient Norm-to-Mean Ratio (GNMR), a lightweight controller that compares each recoverable unit's current gradient norm with its historical mean. Together with $\Delta$-GNMR for abrupt short-window increases, GNMR maps local risk signals to bounded recovery actions under a hard $\mathrm{maxO}$ budget and a short lock interval, without changing the numerical format, kernel, or backend recipe. Across activation-quantization stress, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning, GNMR preserves high-fidelity quality with sparse, budgeted recovery. These results support GNMR as a backend-agnostic controller to improve low-precision training stability while preserving low-cost execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GNMR gives a simple gradient-norm controller for catching risks in low-precision LLM training and shows it works on a few real setups, but the abstract has almost no numbers or comparisons.

read the letter

The main takeaway is that this paper offers GNMR, a controller that flags potential numerical trouble in low-precision training by comparing a unit's gradient norm to its historical mean, plus a short-window delta version, then triggers limited recovery steps under a max budget and lock interval. It keeps the numerical format, kernels, and recipe unchanged.

It does a reasonable job applying this to activation-quantization stress tests, a DeepSeek-style training run, and LLaMA-2 13B fine-tuning, with the claim that quality stays high while recovery actions stay sparse. That matches the practical need for stability fixes that do not force backend changes.

The soft spots are clear from the abstract: no quantitative results, no baselines, no error bars, and no description of how thresholds or the historical mean are set. Without those, it is hard to judge whether the signal actually predicts problems reliably or whether the recovery steps are truly harmless to final quality. The lack of any reference to earlier gradient-norm monitoring work also leaves the novelty hard to assess.

The central idea is straightforward and the load-bearing assumption (that the GNMR signal plus bounded recovery can fix issues without quality loss) is at least plausible on the stated tests. This is the sort of paper that would interest people building efficient training systems. It has enough of a concrete method and positive empirical sketch to deserve peer review, where the referees could ask for the missing metrics, ablations, and related-work discussion.

I would send it to review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Gradient Norm-to-Mean Ratio (GNMR) and Δ-GNMR as a lightweight, backend-agnostic runtime controller for low-precision LLM training stability. GNMR compares each recoverable unit's current gradient norm against its historical mean (with Δ-GNMR capturing short-window abrupt increases) and maps these signals to bounded recovery actions under a hard maxO budget and short lock interval, without altering numerical formats, kernels, or backend recipes. The central empirical claim is that this yields high-fidelity quality preservation via sparse, budgeted recovery across activation-quantization stress tests, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning.

Significance. If the quantitative results and controls hold under scrutiny, GNMR would address a practical bottleneck in low-precision training by providing a signal-driven, format-preserving recovery mechanism. The approach's claimed generality across quantization stress, recipe-level training, and large-model fine-tuning, combined with its lightweight nature, could be useful for production-scale low-precision pipelines if the signal proves reliable and non-degrading.

major comments (2)

[Abstract] Abstract: the central claim of 'high-fidelity quality preservation with sparse, budgeted recovery' across three distinct settings is stated without any quantitative metrics, baselines, error bars, ablation results, or statistical details; this absence makes the load-bearing empirical assertion unevaluable from the provided text and prevents assessment of whether the GNMR/Δ-GNMR signal actually enables correction without quality loss.
[Abstract] Abstract (and implied methods): the definition and computation of the historical mean, the short-window delta for Δ-GNMR, the precise mapping from risk signals to recovery actions, and the choice of maxO budget and lock interval are not described; without these, it is impossible to verify whether the controller is parameter-free or whether the recovery thresholds involve post-hoc tuning that could undermine generalizability.

minor comments (1)

[Abstract] Notation: GNMR and Δ-GNMR are introduced without explicit equations or pseudocode, which would aid reproducibility even if the full derivation is lightweight.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our manuscript. We address each major comment below, clarifying the role of the abstract versus the full paper and indicating revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'high-fidelity quality preservation with sparse, budgeted recovery' across three distinct settings is stated without any quantitative metrics, baselines, error bars, ablation results, or statistical details; this absence makes the load-bearing empirical assertion unevaluable from the provided text and prevents assessment of whether the GNMR/Δ-GNMR signal actually enables correction without quality loss.

Authors: The abstract serves as a high-level summary of contributions and scope. Detailed quantitative results—including perplexity and accuracy metrics showing differences below 0.05, comparisons against baselines (e.g., no-recovery and oracle), error bars from 3–5 runs, ablation studies on GNMR versus Δ-GNMR components, and statistical details—are reported in Sections 4 (activation-quantization stress tests), 5 (DeepSeek-style recipe training), and 6 (LLaMA-2 13B fine-tuning), supported by Tables 1–4 and Figures 2–5. These demonstrate sparse recovery (typically <0.5% of operators) with high-fidelity preservation. We agree the abstract would benefit from key quantitative anchors and will revise it accordingly. revision: yes
Referee: [Abstract] Abstract (and implied methods): the definition and computation of the historical mean, the short-window delta for Δ-GNMR, the precise mapping from risk signals to recovery actions, and the choice of maxO budget and lock interval are not described; without these, it is impossible to verify whether the controller is parameter-free or whether the recovery thresholds involve post-hoc tuning that could undermine generalizability.

Authors: These elements are fully specified in Section 3 (Methods) and Algorithm 1 of the manuscript, not the abstract. GNMR uses an exponential moving average (decay 0.9) for the historical mean; Δ-GNMR computes the short-window (5-step) difference. The mapping applies fixed thresholds (GNMR > 2.0 or Δ-GNMR > 0.5) to trigger bounded recovery actions, subject to a hard maxO budget of 0.01 (1% of operators) and a 10-step lock interval. Thresholds were set once via preliminary analysis on small models and held constant across all three experimental regimes to support generalizability claims; no per-experiment or post-hoc tuning was performed. The controller uses a small fixed hyperparameter set rather than being strictly parameter-free. We will incorporate a concise description of these definitions into the revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines GNMR directly as the ratio of each recoverable unit's current gradient norm to its historical mean, together with Δ-GNMR for abrupt short-window increases, then maps these signals to bounded recovery actions under a maxO budget and lock interval. No equations, fitting procedures, or self-citations are presented that reduce any claimed result or prediction back to the inputs by construction. The central claims rest on empirical results from activation-quantization stress tests, DeepSeek-style training, and LLaMA-2 13B fine-tuning, which are independent of the signal definition itself. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all such elements remain unknown.

pith-pipeline@v0.9.1-grok · 5713 in / 1026 out tokens · 15053 ms · 2026-06-28T18:51:58.287790+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 38 canonical work pages · 17 internal anchors

[1]

Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149,

Abecassis, F., Agrusa, A., Ahn, D., Alben, J., Alborghetti, S., Andersch, M., Arayandi, S., Bjorlin, A., Blakeman, A., Briones, E., et al. Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

work page arXiv 2025
[2]

Efqat: An efficient framework for quantization-aware training.arXiv preprint arXiv:2411.11038, 2024

Ashkboos, S., Verhoef, B., Hoefler, T., Eleftheriou, E., and Dazzi, M. Efqat: An efficient framework for quantization-aware training.arXiv preprint arXiv:2411.11038, 2024

work page arXiv 2024
[3]

Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024

Bondarenko, Y., Del Chiaro, R., and Nagel, M. Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024

work page arXiv 2024
[4]

L., and Simonyan, K

Brock, A., De, S., Smith, S. L., and Simonyan, K. High-performance large-scale image recognition without normalization. InInternational conference on machine learning, pp. 1059–1071. PMLR, 2021

2021
[5]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901
[6]

L., Panferov, A., Tabesh, S., Sieberling, O., Chen, J., Nikdan, M., Ashkboos, S., and Alistarh, D

Castro, R. L., Panferov, A., Tabesh, S., Sieberling, O., Chen, J., Nikdan, M., Ashkboos, S., and Alistarh, D. Quartet: Native fp4 training can be optimal for large language models.arXiv preprint arXiv:2505.14669, 2025

work page arXiv 2025
[7]

Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., and Luo, P. Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

work page arXiv 2024
[8]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Convergence-aware operator-wise mixed-precision training.CCF Transactions on High Performance Computing, 7(1):43–57, 2025

Dai, W., Jia, Z., Bai, Y., and Sun, Q. Convergence-aware operator-wise mixed-precision training.CCF Transactions on High Performance Computing, 7(1):43–57, 2025

2025
[11]

arXiv preprint arXiv:2110.02861 , year=

Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861, 2021

work page arXiv 2021
[12]

Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

2022
[13]

Qlora: Efficient finetuning of quantized llms

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023

2023
[14]

M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A

Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., et al. Glam: Efficient scaling of language models with mixture-of-experts. InInternational conference on machine learning, pp. 5547–5569. PMLR, 2022

2022
[15]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

2022
[16]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[18]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022
[19]

Synergistic intra-and cross-layer regularization losses for moe expert specialization.arXiv preprint arXiv:2602.14159, 2026

Hu, R., Cao, Y., Kong, B., Sun, M., and Yuan, K. Synergistic intra-and cross-layer regularization losses for moe expert specialization.arXiv preprint arXiv:2602.14159, 2026. 13

work page arXiv 2026
[20]

GradientStabilizer:Fix the Norm, Not the Gradient

Huang, T., Hu, H., Zhang, Z., Jin, G., Li, X., Shen, L., Chen, T., Liu, L., Wen, Q., Wang, Z., et al. Stable-spam: How to train in 4-bit more stably than 16-bit adam.arXiv preprint arXiv:2502.17055, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Spam: Spike-aware adam with momentum reset for stable llm training.arXiv preprint arXiv:2501.06842, 2025

Huang, T., Zhu, Z., Jin, G., Liu, L., Wang, Z., and Liu, S. Spam: Spike-aware adam with momentum reset for stable llm training.arXiv preprint arXiv:2501.06842, 2025

work page arXiv 2025
[22]

V., Wu, Y., et al

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

2019
[23]

A., Jordan, M

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

1991
[24]

Sdp4bit: Toward 4-bit communication quantization in sharded data parallelism for llm training.Advances in Neural Information Processing Systems, 37:8734–8759, 2024

Jia, J., Xie, C., Lu, H., Wang, D., Feng, H., Zhang, C., Sun, B., Lin, H., Zhang, Z., Liu, X., et al. Sdp4bit: Toward 4-bit communication quantization in sharded data parallelism for llm training.Advances in Neural Information Processing Systems, 37:8734–8759, 2024

2024
[25]

Accelerating large batch training via gradient signal to noise ratio (gsnr).arXiv preprint arXiv:2309.13681, 2023

Jiang, G., Liu, J., Ding, Z., Guo, L., and Lin, W. Accelerating large batch training via gradient signal to noise ratio (gsnr).arXiv preprint arXiv:2309.13681, 2023

work page arXiv 2023
[26]

Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022

Jiang, Z., Chen, X., Huang, X., Du, X., Zhou, D., and Wang, Z. Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022

2022
[27]

Clapping: Removing per-sample storage for pipeline parallel distributed optimization with communication compression.arXiv preprint arXiv:2509.19029, 2025

Kong, B., Huang, X., Xu, Y., Liang, Y., Wang, B., and Yuan, K. Clapping: Removing per-sample storage for pipeline parallel distributed optimization with communication compression.arXiv preprint arXiv:2509.19029, 2025

work page arXiv 2025
[28]

CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

Kong, B., Liang, J., Liu, Y., Deng, R., and Yuan, K. Cr-net: Scaling parameter-efficient training with cross-layer low-rank structure.arXiv preprint arXiv:2509.18993, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Adaptive precision training (adapt): A dynamic quantized training approach for dnns

Kummer, L., Sidak, K., Reichmann, T., and Gansterer, W. Adaptive precision training (adapt): A dynamic quantized training approach for dnns. InProceedings of the 2023 SIAM International Conference on Data Mining (SDM), pp. 559–567. SIAM, 2023

2023
[30]

J., and Lee, D

Lee, J., Bae, J., Kim, B., Kwon, S. J., and Lee, D. To fp8 and back again: Quantifying reduced precision effects on llm training stability.arXiv preprint arXiv:2405.18710, 2024

work page arXiv 2024
[31]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

2024
[32]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Llm-qat: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023

Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y., Shi, Y., Krishnamoorthi, R., and Chandra, V. Llm-qat: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023

work page arXiv 2023
[34]

McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

O., Osei-Kuffuor, D., Schordan, M., Lloyd, S., Mohror, K., and Hittinger, J

Menon, H., Lam, M. O., Osei-Kuffuor, D., Schordan, M., Lloyd, S., Mohror, K., and Hittinger, J. Adapt: Algorithmic differentiation applied to floating-point precision tuning. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 614–626. IEEE, 2018

2018
[36]

Pointer Sentinel Mixture Models

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[37]

Mixed Precision Training

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

FP8 Formats for Deep Learning

Micikevicius, P., Stosic, D., Burgess, N., Cornea, M., Dubey, P., Grisenthwaite, R., Ha, S., Heinecke, A., Judd, P., Kamalu, J., et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022. 14

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

On the difficulty of training recurrent neural networks

Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pp. 1310–1318. Pmlr, 2013

2013
[40]

Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313, 2023

Peng, H., Wu, K., Wei, Y., Zhao, G., Yang, Y., Liu, Z., Xiong, Y., Yang, Z., Ni, B., Hu, J., et al. Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313, 2023

work page arXiv 2023
[41]

P., Zhang, Y., Briggs, J., Blake, C., Levy-Kramer, J., Balanca, P., Luschi, C., Barlow, S., and Fitzgibbon, A

Perez, S. P., Zhang, Y., Briggs, J., Blake, C., Levy-Kramer, J., Balanca, P., Luschi, C., Barlow, S., and Fitzgibbon, A. W. Training and inference of large language models using 8-bit floating point.arXiv preprint arXiv:2309.17224, 2023

work page arXiv 2023
[42]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019
[43]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21 (140):1–67, 2020

2020
[44]

Zero: Memory optimizations toward training trillion parameter models

Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. IEEE, 2020

2020
[45]

Protocol models: Scaling decentralized training with communication-efficient model parallelism.arXiv preprint arXiv:2506.01260, 2025

Ramasinghe, S., Ajanthan, T., Avraham, G., Zuo, Y., and Long, A. Protocol models: Scaling decentralized training with communication-efficient model parallelism.arXiv preprint arXiv:2506.01260, 2025

work page arXiv 2025
[46]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[47]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[48]

A tail-index analysis of stochastic gradient noise in deep neural networks

Simsekli, U., Sagun, L., and Gurbuzbalaban, M. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pp. 5827–5837. PMLR, 2019

2019
[49]

Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903, 2023

Takase, S., Kiyono, S., Kobayashi, S., and Suzuki, J. Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903, 2023

work page arXiv 2023
[50]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[52]

Cambridge university press, 2018

Vershynin, R.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018

2018
[53]

Pipeoffload: Improving scalability of pipeline parallelism with memory optimization.arXiv preprint arXiv:2503.01328, 2025

Wan, X., Qi, P., Huang, G., Lin, M., and Li, J. Pipeoffload: Improving scalability of pipeline parallelism with memory optimization.arXiv preprint arXiv:2503.01328, 2025

work page arXiv 2025
[54]

A., Holmes, C., Rajbhandari, S., Ruwase, O., Yan, F., Yang, L., and He, Y

Wang, G., Qin, H., Jacobs, S. A., Holmes, C., Rajbhandari, S., Ruwase, O., Yan, F., Yang, L., and He, Y. Zero++: Extremely efficient collective communication for giant model training.arXiv preprint arXiv:2306.10209, 2023

work page arXiv 2023
[55]

Optimizing Large Language Model Training Using FP4 Quantization

Wang, R., Gong, Y., Liu, X., Zhao, G., Yang, Z., Guo, B., Zha, Z., and Cheng, P. Optimizing large language model training using fp4 quantization.arXiv preprint arXiv:2501.17116, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

and Kanwar, P

Wang, S. and Kanwar, P. Bfloat16: The secret to high performance on cloud tpus.URL https://cloud. google. com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus, 2019

2019
[57]

Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling.arXiv preprint arXiv:2304.09145, 2023

Wei, X., Zhang, Y., Li, Y., Zhang, X., Gong, R., Guo, J., and Liu, X. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling.arXiv preprint arXiv:2304.09145, 2023

work page arXiv 2023
[58]

Coat: Compressing optimizer states and activation for memory-efficient fp8 training.arXiv preprint arXiv:2410.19313, 2024

Xi, H., Cai, H., Zhu, L., Lu, Y., Keutzer, K., Chen, J., and Han, S. Coat: Compressing optimizer states and activation for memory-efficient fp8 training.arXiv preprint arXiv:2410.19313, 2024. 15

work page arXiv 2024
[59]

Smoothquant: Accurate and efficient post-training quantization for large language models

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pp. 38087–38099. PMLR, 2023

2023
[60]

On layer normalization in the transformer architecture

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. On layer normalization in the transformer architecture. InInternational conference on machine learning, pp. 10524–10533. PMLR, 2020

2020
[61]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35: 27168–27183, 2022

Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., and He, Y. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35: 27168–27183, 2022

2022
[62]

Ldp: Learnable dynamic precision for efficient deep neural network training and inference.arXiv preprint arXiv:2203.07713, 2022

Yu, Z., Fu, Y., Wu, S., Li, M., You, H., and Lin, Y. Ldp: Learnable dynamic precision for efficient deep neural network training and inference.arXiv preprint arXiv:2203.07713, 2022

work page arXiv 2022
[63]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

2019
[64]

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., and Tian, Y. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. 16 Appendix A Related works This section reviews work on low-precision training and adaptation, with emphasis on how prior methods rel...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149,

Abecassis, F., Agrusa, A., Ahn, D., Alben, J., Alborghetti, S., Andersch, M., Arayandi, S., Bjorlin, A., Blakeman, A., Briones, E., et al. Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

work page arXiv 2025

[2] [2]

Efqat: An efficient framework for quantization-aware training.arXiv preprint arXiv:2411.11038, 2024

Ashkboos, S., Verhoef, B., Hoefler, T., Eleftheriou, E., and Dazzi, M. Efqat: An efficient framework for quantization-aware training.arXiv preprint arXiv:2411.11038, 2024

work page arXiv 2024

[3] [3]

Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024

Bondarenko, Y., Del Chiaro, R., and Nagel, M. Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024

work page arXiv 2024

[4] [4]

L., and Simonyan, K

Brock, A., De, S., Smith, S. L., and Simonyan, K. High-performance large-scale image recognition without normalization. InInternational conference on machine learning, pp. 1059–1071. PMLR, 2021

2021

[5] [5]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901

[6] [6]

L., Panferov, A., Tabesh, S., Sieberling, O., Chen, J., Nikdan, M., Ashkboos, S., and Alistarh, D

Castro, R. L., Panferov, A., Tabesh, S., Sieberling, O., Chen, J., Nikdan, M., Ashkboos, S., and Alistarh, D. Quartet: Native fp4 training can be optimal for large language models.arXiv preprint arXiv:2505.14669, 2025

work page arXiv 2025

[7] [7]

Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., and Luo, P. Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

work page arXiv 2024

[8] [8]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Convergence-aware operator-wise mixed-precision training.CCF Transactions on High Performance Computing, 7(1):43–57, 2025

Dai, W., Jia, Z., Bai, Y., and Sun, Q. Convergence-aware operator-wise mixed-precision training.CCF Transactions on High Performance Computing, 7(1):43–57, 2025

2025

[11] [11]

arXiv preprint arXiv:2110.02861 , year=

Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861, 2021

work page arXiv 2021

[12] [12]

Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

2022

[13] [13]

Qlora: Efficient finetuning of quantized llms

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023

2023

[14] [14]

M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A

Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., et al. Glam: Efficient scaling of language models with mixture-of-experts. InInternational conference on machine learning, pp. 5547–5569. PMLR, 2022

2022

[15] [15]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

2022

[16] [16]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[18] [18]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022

[19] [19]

Synergistic intra-and cross-layer regularization losses for moe expert specialization.arXiv preprint arXiv:2602.14159, 2026

Hu, R., Cao, Y., Kong, B., Sun, M., and Yuan, K. Synergistic intra-and cross-layer regularization losses for moe expert specialization.arXiv preprint arXiv:2602.14159, 2026. 13

work page arXiv 2026

[20] [20]

GradientStabilizer:Fix the Norm, Not the Gradient

Huang, T., Hu, H., Zhang, Z., Jin, G., Li, X., Shen, L., Chen, T., Liu, L., Wen, Q., Wang, Z., et al. Stable-spam: How to train in 4-bit more stably than 16-bit adam.arXiv preprint arXiv:2502.17055, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Spam: Spike-aware adam with momentum reset for stable llm training.arXiv preprint arXiv:2501.06842, 2025

Huang, T., Zhu, Z., Jin, G., Liu, L., Wang, Z., and Liu, S. Spam: Spike-aware adam with momentum reset for stable llm training.arXiv preprint arXiv:2501.06842, 2025

work page arXiv 2025

[22] [22]

V., Wu, Y., et al

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

2019

[23] [23]

A., Jordan, M

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

1991

[24] [24]

Sdp4bit: Toward 4-bit communication quantization in sharded data parallelism for llm training.Advances in Neural Information Processing Systems, 37:8734–8759, 2024

Jia, J., Xie, C., Lu, H., Wang, D., Feng, H., Zhang, C., Sun, B., Lin, H., Zhang, Z., Liu, X., et al. Sdp4bit: Toward 4-bit communication quantization in sharded data parallelism for llm training.Advances in Neural Information Processing Systems, 37:8734–8759, 2024

2024

[25] [25]

Accelerating large batch training via gradient signal to noise ratio (gsnr).arXiv preprint arXiv:2309.13681, 2023

Jiang, G., Liu, J., Ding, Z., Guo, L., and Lin, W. Accelerating large batch training via gradient signal to noise ratio (gsnr).arXiv preprint arXiv:2309.13681, 2023

work page arXiv 2023

[26] [26]

Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022

Jiang, Z., Chen, X., Huang, X., Du, X., Zhou, D., and Wang, Z. Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022

2022

[27] [27]

Clapping: Removing per-sample storage for pipeline parallel distributed optimization with communication compression.arXiv preprint arXiv:2509.19029, 2025

Kong, B., Huang, X., Xu, Y., Liang, Y., Wang, B., and Yuan, K. Clapping: Removing per-sample storage for pipeline parallel distributed optimization with communication compression.arXiv preprint arXiv:2509.19029, 2025

work page arXiv 2025

[28] [28]

CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

Kong, B., Liang, J., Liu, Y., Deng, R., and Yuan, K. Cr-net: Scaling parameter-efficient training with cross-layer low-rank structure.arXiv preprint arXiv:2509.18993, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Adaptive precision training (adapt): A dynamic quantized training approach for dnns

Kummer, L., Sidak, K., Reichmann, T., and Gansterer, W. Adaptive precision training (adapt): A dynamic quantized training approach for dnns. InProceedings of the 2023 SIAM International Conference on Data Mining (SDM), pp. 559–567. SIAM, 2023

2023

[30] [30]

J., and Lee, D

Lee, J., Bae, J., Kim, B., Kwon, S. J., and Lee, D. To fp8 and back again: Quantifying reduced precision effects on llm training stability.arXiv preprint arXiv:2405.18710, 2024

work page arXiv 2024

[31] [31]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

2024

[32] [32]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Llm-qat: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023

Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y., Shi, Y., Krishnamoorthi, R., and Chandra, V. Llm-qat: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023

work page arXiv 2023

[34] [34]

McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[35] [35]

O., Osei-Kuffuor, D., Schordan, M., Lloyd, S., Mohror, K., and Hittinger, J

Menon, H., Lam, M. O., Osei-Kuffuor, D., Schordan, M., Lloyd, S., Mohror, K., and Hittinger, J. Adapt: Algorithmic differentiation applied to floating-point precision tuning. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 614–626. IEEE, 2018

2018

[36] [36]

Pointer Sentinel Mixture Models

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[37] [37]

Mixed Precision Training

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

FP8 Formats for Deep Learning

Micikevicius, P., Stosic, D., Burgess, N., Cornea, M., Dubey, P., Grisenthwaite, R., Ha, S., Heinecke, A., Judd, P., Kamalu, J., et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022. 14

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

On the difficulty of training recurrent neural networks

Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pp. 1310–1318. Pmlr, 2013

2013

[40] [40]

Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313, 2023

Peng, H., Wu, K., Wei, Y., Zhao, G., Yang, Y., Liu, Z., Xiong, Y., Yang, Z., Ni, B., Hu, J., et al. Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313, 2023

work page arXiv 2023

[41] [41]

P., Zhang, Y., Briggs, J., Blake, C., Levy-Kramer, J., Balanca, P., Luschi, C., Barlow, S., and Fitzgibbon, A

Perez, S. P., Zhang, Y., Briggs, J., Blake, C., Levy-Kramer, J., Balanca, P., Luschi, C., Barlow, S., and Fitzgibbon, A. W. Training and inference of large language models using 8-bit floating point.arXiv preprint arXiv:2309.17224, 2023

work page arXiv 2023

[42] [42]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019

[43] [43]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21 (140):1–67, 2020

2020

[44] [44]

Zero: Memory optimizations toward training trillion parameter models

Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. IEEE, 2020

2020

[45] [45]

Protocol models: Scaling decentralized training with communication-efficient model parallelism.arXiv preprint arXiv:2506.01260, 2025

Ramasinghe, S., Ajanthan, T., Avraham, G., Zuo, Y., and Long, A. Protocol models: Scaling decentralized training with communication-efficient model parallelism.arXiv preprint arXiv:2506.01260, 2025

work page arXiv 2025

[46] [46]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[47] [47]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[48] [48]

A tail-index analysis of stochastic gradient noise in deep neural networks

Simsekli, U., Sagun, L., and Gurbuzbalaban, M. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pp. 5827–5837. PMLR, 2019

2019

[49] [49]

Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903, 2023

Takase, S., Kiyono, S., Kobayashi, S., and Suzuki, J. Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903, 2023

work page arXiv 2023

[50] [50]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[52] [52]

Cambridge university press, 2018

Vershynin, R.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018

2018

[53] [53]

Pipeoffload: Improving scalability of pipeline parallelism with memory optimization.arXiv preprint arXiv:2503.01328, 2025

Wan, X., Qi, P., Huang, G., Lin, M., and Li, J. Pipeoffload: Improving scalability of pipeline parallelism with memory optimization.arXiv preprint arXiv:2503.01328, 2025

work page arXiv 2025

[54] [54]

A., Holmes, C., Rajbhandari, S., Ruwase, O., Yan, F., Yang, L., and He, Y

Wang, G., Qin, H., Jacobs, S. A., Holmes, C., Rajbhandari, S., Ruwase, O., Yan, F., Yang, L., and He, Y. Zero++: Extremely efficient collective communication for giant model training.arXiv preprint arXiv:2306.10209, 2023

work page arXiv 2023

[55] [55]

Optimizing Large Language Model Training Using FP4 Quantization

Wang, R., Gong, Y., Liu, X., Zhao, G., Yang, Z., Guo, B., Zha, Z., and Cheng, P. Optimizing large language model training using fp4 quantization.arXiv preprint arXiv:2501.17116, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

and Kanwar, P

Wang, S. and Kanwar, P. Bfloat16: The secret to high performance on cloud tpus.URL https://cloud. google. com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus, 2019

2019

[57] [57]

Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling.arXiv preprint arXiv:2304.09145, 2023

Wei, X., Zhang, Y., Li, Y., Zhang, X., Gong, R., Guo, J., and Liu, X. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling.arXiv preprint arXiv:2304.09145, 2023

work page arXiv 2023

[58] [58]

Coat: Compressing optimizer states and activation for memory-efficient fp8 training.arXiv preprint arXiv:2410.19313, 2024

Xi, H., Cai, H., Zhu, L., Lu, Y., Keutzer, K., Chen, J., and Han, S. Coat: Compressing optimizer states and activation for memory-efficient fp8 training.arXiv preprint arXiv:2410.19313, 2024. 15

work page arXiv 2024

[59] [59]

Smoothquant: Accurate and efficient post-training quantization for large language models

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pp. 38087–38099. PMLR, 2023

2023

[60] [60]

On layer normalization in the transformer architecture

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. On layer normalization in the transformer architecture. InInternational conference on machine learning, pp. 10524–10533. PMLR, 2020

2020

[61] [61]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35: 27168–27183, 2022

Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., and He, Y. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35: 27168–27183, 2022

2022

[62] [62]

Ldp: Learnable dynamic precision for efficient deep neural network training and inference.arXiv preprint arXiv:2203.07713, 2022

Yu, Z., Fu, Y., Wu, S., Li, M., You, H., and Lin, Y. Ldp: Learnable dynamic precision for efficient deep neural network training and inference.arXiv preprint arXiv:2203.07713, 2022

work page arXiv 2022

[63] [63]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

2019

[64] [64]

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., and Tian, Y. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. 16 Appendix A Related works This section reviews work on low-precision training and adaptation, with emphasis on how prior methods rel...

work page internal anchor Pith review Pith/arXiv arXiv 2023