GNMR: Runtime Stability Control for Low-Precision Large Language Model Training
Pith reviewed 2026-06-28 18:51 UTC · model grok-4.3
The pith
GNMR detects numerical risks in low-precision LLM training by comparing gradient norms to their historical means and triggers budgeted recovery actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GNMR is a backend-agnostic controller that maps local gradient norm signals to recovery actions under hard maxO budget and short lock interval, preserving high-fidelity quality in low-precision training with sparse, budgeted recovery.
What carries the argument
The Gradient Norm-to-Mean Ratio (GNMR) and its delta variant, which compare current gradient norms to historical means to signal numerical risk and initiate recovery.
If this is right
- Low-precision paths can be used more reliably without frequent numerical issues.
- Recovery is sparse and budgeted, minimizing impact on training efficiency.
- Quality remains high-fidelity in various training scenarios including fine-tuning.
- The controller works without changes to existing numerical formats or kernels.
Where Pith is reading between the lines
- Such controllers might be combined with other monitoring techniques for broader coverage.
- The approach may apply to other types of numerical instabilities in deep learning beyond gradients.
Load-bearing premise
That the gradient norm relative to its historical mean provides a reliable signal of numerical risk correctable by bounded recovery without degrading final quality.
What would settle it
A case where applying GNMR recovery leads to lower final model quality than training without it, or undetected instability causes failure despite GNMR monitoring.
read the original abstract
Training stability is a key bottleneck in low-precision language model training: efficient low-cost paths can still produce short-lived numerical risks at a small set of operators. We formulate this as runtime stability control and present Gradient Norm-to-Mean Ratio (GNMR), a lightweight controller that compares each recoverable unit's current gradient norm with its historical mean. Together with $\Delta$-GNMR for abrupt short-window increases, GNMR maps local risk signals to bounded recovery actions under a hard $\mathrm{maxO}$ budget and a short lock interval, without changing the numerical format, kernel, or backend recipe. Across activation-quantization stress, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning, GNMR preserves high-fidelity quality with sparse, budgeted recovery. These results support GNMR as a backend-agnostic controller to improve low-precision training stability while preserving low-cost execution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Gradient Norm-to-Mean Ratio (GNMR) and Δ-GNMR as a lightweight, backend-agnostic runtime controller for low-precision LLM training stability. GNMR compares each recoverable unit's current gradient norm against its historical mean (with Δ-GNMR capturing short-window abrupt increases) and maps these signals to bounded recovery actions under a hard maxO budget and short lock interval, without altering numerical formats, kernels, or backend recipes. The central empirical claim is that this yields high-fidelity quality preservation via sparse, budgeted recovery across activation-quantization stress tests, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning.
Significance. If the quantitative results and controls hold under scrutiny, GNMR would address a practical bottleneck in low-precision training by providing a signal-driven, format-preserving recovery mechanism. The approach's claimed generality across quantization stress, recipe-level training, and large-model fine-tuning, combined with its lightweight nature, could be useful for production-scale low-precision pipelines if the signal proves reliable and non-degrading.
major comments (2)
- [Abstract] Abstract: the central claim of 'high-fidelity quality preservation with sparse, budgeted recovery' across three distinct settings is stated without any quantitative metrics, baselines, error bars, ablation results, or statistical details; this absence makes the load-bearing empirical assertion unevaluable from the provided text and prevents assessment of whether the GNMR/Δ-GNMR signal actually enables correction without quality loss.
- [Abstract] Abstract (and implied methods): the definition and computation of the historical mean, the short-window delta for Δ-GNMR, the precise mapping from risk signals to recovery actions, and the choice of maxO budget and lock interval are not described; without these, it is impossible to verify whether the controller is parameter-free or whether the recovery thresholds involve post-hoc tuning that could undermine generalizability.
minor comments (1)
- [Abstract] Notation: GNMR and Δ-GNMR are introduced without explicit equations or pseudocode, which would aid reproducibility even if the full derivation is lightweight.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive feedback on our manuscript. We address each major comment below, clarifying the role of the abstract versus the full paper and indicating revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of 'high-fidelity quality preservation with sparse, budgeted recovery' across three distinct settings is stated without any quantitative metrics, baselines, error bars, ablation results, or statistical details; this absence makes the load-bearing empirical assertion unevaluable from the provided text and prevents assessment of whether the GNMR/Δ-GNMR signal actually enables correction without quality loss.
Authors: The abstract serves as a high-level summary of contributions and scope. Detailed quantitative results—including perplexity and accuracy metrics showing differences below 0.05, comparisons against baselines (e.g., no-recovery and oracle), error bars from 3–5 runs, ablation studies on GNMR versus Δ-GNMR components, and statistical details—are reported in Sections 4 (activation-quantization stress tests), 5 (DeepSeek-style recipe training), and 6 (LLaMA-2 13B fine-tuning), supported by Tables 1–4 and Figures 2–5. These demonstrate sparse recovery (typically <0.5% of operators) with high-fidelity preservation. We agree the abstract would benefit from key quantitative anchors and will revise it accordingly. revision: yes
-
Referee: [Abstract] Abstract (and implied methods): the definition and computation of the historical mean, the short-window delta for Δ-GNMR, the precise mapping from risk signals to recovery actions, and the choice of maxO budget and lock interval are not described; without these, it is impossible to verify whether the controller is parameter-free or whether the recovery thresholds involve post-hoc tuning that could undermine generalizability.
Authors: These elements are fully specified in Section 3 (Methods) and Algorithm 1 of the manuscript, not the abstract. GNMR uses an exponential moving average (decay 0.9) for the historical mean; Δ-GNMR computes the short-window (5-step) difference. The mapping applies fixed thresholds (GNMR > 2.0 or Δ-GNMR > 0.5) to trigger bounded recovery actions, subject to a hard maxO budget of 0.01 (1% of operators) and a 10-step lock interval. Thresholds were set once via preliminary analysis on small models and held constant across all three experimental regimes to support generalizability claims; no per-experiment or post-hoc tuning was performed. The controller uses a small fixed hyperparameter set rather than being strictly parameter-free. We will incorporate a concise description of these definitions into the revised abstract. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper defines GNMR directly as the ratio of each recoverable unit's current gradient norm to its historical mean, together with Δ-GNMR for abrupt short-window increases, then maps these signals to bounded recovery actions under a maxO budget and lock interval. No equations, fitting procedures, or self-citations are presented that reduce any claimed result or prediction back to the inputs by construction. The central claims rest on empirical results from activation-quantization stress tests, DeepSeek-style training, and LLaMA-2 13B fine-tuning, which are independent of the signal definition itself. The derivation is therefore self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149,
Abecassis, F., Agrusa, A., Ahn, D., Alben, J., Alborghetti, S., Andersch, M., Arayandi, S., Bjorlin, A., Blakeman, A., Briones, E., et al. Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025
-
[2]
Efqat: An efficient framework for quantization-aware training.arXiv preprint arXiv:2411.11038, 2024
Ashkboos, S., Verhoef, B., Hoefler, T., Eleftheriou, E., and Dazzi, M. Efqat: An efficient framework for quantization-aware training.arXiv preprint arXiv:2411.11038, 2024
-
[3]
Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024
Bondarenko, Y., Del Chiaro, R., and Nagel, M. Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024
-
[4]
L., and Simonyan, K
Brock, A., De, S., Smith, S. L., and Simonyan, K. High-performance large-scale image recognition without normalization. InInternational conference on machine learning, pp. 1059–1071. PMLR, 2021
2021
-
[5]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
1901
-
[6]
L., Panferov, A., Tabesh, S., Sieberling, O., Chen, J., Nikdan, M., Ashkboos, S., and Alistarh, D
Castro, R. L., Panferov, A., Tabesh, S., Sieberling, O., Chen, J., Nikdan, M., Ashkboos, S., and Alistarh, D. Quartet: Native fp4 training can be optimal for large language models.arXiv preprint arXiv:2505.14669, 2025
-
[7]
Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., and Luo, P. Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024
-
[8]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Convergence-aware operator-wise mixed-precision training.CCF Transactions on High Performance Computing, 7(1):43–57, 2025
Dai, W., Jia, Z., Bai, Y., and Sun, Q. Convergence-aware operator-wise mixed-precision training.CCF Transactions on High Performance Computing, 7(1):43–57, 2025
2025
-
[11]
arXiv preprint arXiv:2110.02861 , year=
Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861, 2021
-
[12]
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022
2022
-
[13]
Qlora: Efficient finetuning of quantized llms
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023
2023
-
[14]
M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A
Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., et al. Glam: Efficient scaling of language models with mixture-of-experts. InInternational conference on machine learning, pp. 5547–5569. PMLR, 2022
2022
-
[15]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022
Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022
2022
-
[16]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Measuring Massive Multitask Language Understanding
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[18]
J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
2022
-
[19]
Hu, R., Cao, Y., Kong, B., Sun, M., and Yuan, K. Synergistic intra-and cross-layer regularization losses for moe expert specialization.arXiv preprint arXiv:2602.14159, 2026. 13
-
[20]
GradientStabilizer:Fix the Norm, Not the Gradient
Huang, T., Hu, H., Zhang, Z., Jin, G., Li, X., Shen, L., Chen, T., Liu, L., Wen, Q., Wang, Z., et al. Stable-spam: How to train in 4-bit more stably than 16-bit adam.arXiv preprint arXiv:2502.17055, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Huang, T., Zhu, Z., Jin, G., Liu, L., Wang, Z., and Liu, S. Spam: Spike-aware adam with momentum reset for stable llm training.arXiv preprint arXiv:2501.06842, 2025
-
[22]
V., Wu, Y., et al
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019
2019
-
[23]
A., Jordan, M
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991
1991
-
[24]
Sdp4bit: Toward 4-bit communication quantization in sharded data parallelism for llm training.Advances in Neural Information Processing Systems, 37:8734–8759, 2024
Jia, J., Xie, C., Lu, H., Wang, D., Feng, H., Zhang, C., Sun, B., Lin, H., Zhang, Z., Liu, X., et al. Sdp4bit: Toward 4-bit communication quantization in sharded data parallelism for llm training.Advances in Neural Information Processing Systems, 37:8734–8759, 2024
2024
-
[25]
Jiang, G., Liu, J., Ding, Z., Guo, L., and Lin, W. Accelerating large batch training via gradient signal to noise ratio (gsnr).arXiv preprint arXiv:2309.13681, 2023
-
[26]
Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022
Jiang, Z., Chen, X., Huang, X., Du, X., Zhou, D., and Wang, Z. Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022
2022
-
[27]
Kong, B., Huang, X., Xu, Y., Liang, Y., Wang, B., and Yuan, K. Clapping: Removing per-sample storage for pipeline parallel distributed optimization with communication compression.arXiv preprint arXiv:2509.19029, 2025
-
[28]
CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure
Kong, B., Liang, J., Liu, Y., Deng, R., and Yuan, K. Cr-net: Scaling parameter-efficient training with cross-layer low-rank structure.arXiv preprint arXiv:2509.18993, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Adaptive precision training (adapt): A dynamic quantized training approach for dnns
Kummer, L., Sidak, K., Reichmann, T., and Gansterer, W. Adaptive precision training (adapt): A dynamic quantized training approach for dnns. InProceedings of the 2023 SIAM International Conference on Data Mining (SDM), pp. 559–567. SIAM, 2023
2023
-
[30]
Lee, J., Bae, J., Kim, B., Kwon, S. J., and Lee, D. To fp8 and back again: Quantifying reduced precision effects on llm training stability.arXiv preprint arXiv:2405.18710, 2024
-
[31]
Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024
Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024
2024
-
[32]
Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y., Shi, Y., Krishnamoorthi, R., and Chandra, V. Llm-qat: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023
-
[34]
McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
O., Osei-Kuffuor, D., Schordan, M., Lloyd, S., Mohror, K., and Hittinger, J
Menon, H., Lam, M. O., Osei-Kuffuor, D., Schordan, M., Lloyd, S., Mohror, K., and Hittinger, J. Adapt: Algorithmic differentiation applied to floating-point precision tuning. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 614–626. IEEE, 2018
2018
-
[36]
Pointer Sentinel Mixture Models
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[37]
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
Micikevicius, P., Stosic, D., Burgess, N., Cornea, M., Dubey, P., Grisenthwaite, R., Ha, S., Heinecke, A., Judd, P., Kamalu, J., et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022. 14
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
On the difficulty of training recurrent neural networks
Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pp. 1310–1318. Pmlr, 2013
2013
-
[40]
Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313, 2023
Peng, H., Wu, K., Wei, Y., Zhao, G., Yang, Y., Liu, Z., Xiong, Y., Yang, Z., Ni, B., Hu, J., et al. Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313, 2023
-
[41]
Perez, S. P., Zhang, Y., Briggs, J., Blake, C., Levy-Kramer, J., Balanca, P., Luschi, C., Barlow, S., and Fitzgibbon, A. W. Training and inference of large language models using 8-bit floating point.arXiv preprint arXiv:2309.17224, 2023
-
[42]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
2019
-
[43]
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21 (140):1–67, 2020
2020
-
[44]
Zero: Memory optimizations toward training trillion parameter models
Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. IEEE, 2020
2020
-
[45]
Ramasinghe, S., Ajanthan, T., Avraham, G., Zuo, Y., and Long, A. Protocol models: Scaling decentralized training with communication-efficient model parallelism.arXiv preprint arXiv:2506.01260, 2025
-
[46]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[47]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[48]
A tail-index analysis of stochastic gradient noise in deep neural networks
Simsekli, U., Sagun, L., and Gurbuzbalaban, M. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pp. 5827–5837. PMLR, 2019
2019
-
[49]
Takase, S., Kiyono, S., Kobayashi, S., and Suzuki, J. Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903, 2023
-
[50]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
N., Kaiser, Ł., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[52]
Cambridge university press, 2018
Vershynin, R.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018
2018
-
[53]
Wan, X., Qi, P., Huang, G., Lin, M., and Li, J. Pipeoffload: Improving scalability of pipeline parallelism with memory optimization.arXiv preprint arXiv:2503.01328, 2025
-
[54]
A., Holmes, C., Rajbhandari, S., Ruwase, O., Yan, F., Yang, L., and He, Y
Wang, G., Qin, H., Jacobs, S. A., Holmes, C., Rajbhandari, S., Ruwase, O., Yan, F., Yang, L., and He, Y. Zero++: Extremely efficient collective communication for giant model training.arXiv preprint arXiv:2306.10209, 2023
-
[55]
Optimizing Large Language Model Training Using FP4 Quantization
Wang, R., Gong, Y., Liu, X., Zhao, G., Yang, Z., Guo, B., Zha, Z., and Cheng, P. Optimizing large language model training using fp4 quantization.arXiv preprint arXiv:2501.17116, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
and Kanwar, P
Wang, S. and Kanwar, P. Bfloat16: The secret to high performance on cloud tpus.URL https://cloud. google. com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus, 2019
2019
-
[57]
Wei, X., Zhang, Y., Li, Y., Zhang, X., Gong, R., Guo, J., and Liu, X. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling.arXiv preprint arXiv:2304.09145, 2023
-
[58]
Xi, H., Cai, H., Zhu, L., Lu, Y., Keutzer, K., Chen, J., and Han, S. Coat: Compressing optimizer states and activation for memory-efficient fp8 training.arXiv preprint arXiv:2410.19313, 2024. 15
-
[59]
Smoothquant: Accurate and efficient post-training quantization for large language models
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pp. 38087–38099. PMLR, 2023
2023
-
[60]
On layer normalization in the transformer architecture
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. On layer normalization in the transformer architecture. InInternational conference on machine learning, pp. 10524–10533. PMLR, 2020
2020
-
[61]
Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35: 27168–27183, 2022
Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., and He, Y. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35: 27168–27183, 2022
2022
-
[62]
Yu, Z., Fu, Y., Wu, S., Li, M., You, H., and Lin, Y. Ldp: Learnable dynamic precision for efficient deep neural network training and inference.arXiv preprint arXiv:2203.07713, 2022
-
[63]
Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
2019
-
[64]
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., and Tian, Y. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. 16 Appendix A Related works This section reviews work on low-precision training and adaptation, with emphasis on how prior methods rel...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.