pith. sign in

arxiv: 2305.10947 · v7 · submitted 2023-05-18 · 💻 cs.LG · cs.AI· cs.CV· cs.PF

Revisiting 16-bit Neural Network Training: A Practical Approach for Resource-Limited Learning

Pith reviewed 2026-05-24 08:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.PF
keywords 16-bit precisionneural network trainingmixed precisionfloating-point errorsclassification toleranceresource-limited learningcomputational efficiency
0
0 comments X

The pith

Standalone 16-bit neural networks match 32-bit accuracy while running faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that training neural networks using only 16-bit precision throughout can produce accuracy equal to 32-bit or mixed-precision training. It backs the claim with a theoretical analysis of floating-point errors and classification tolerance plus extensive experiments. This would matter to practitioners who lack hardware for lower formats like FP8 and must choose between 32-bit, 16-bit, or mixtures. The work is presented as the first systematic validation of the widespread assumption that 16-bit suffices on its own.

Core claim

The paper claims that standalone 16-bit precision neural networks match 32-bit and mixed-precision in accuracy while boosting computational speed. This is shown through a theoretical formalization of floating-point errors and classification tolerance that explains when 16-bit can approximate 32-bit results, backed by extensive empirical evaluation.

What carries the argument

Theoretical formalization of floating-point errors and classification tolerance that identifies conditions for 16-bit to approximate 32-bit training outcomes.

If this is right

  • 16-bit standalone training becomes a viable option for resource-limited practitioners without accuracy loss.
  • Training speed increases due to lower precision computations across available GPUs.
  • Practitioners can select precision based on hardware access rather than expected accuracy differences.
  • The approach applies to a range of models because the error analysis is not tied to specific architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If confirmed, frameworks could default more training routines to 16-bit to cut memory use in small-scale or educational settings.
  • The result might reduce reliance on mixed-precision libraries when only basic hardware is present.
  • It opens questions about whether the same tolerance holds when 16-bit is combined with other efficiency methods such as pruning.

Load-bearing premise

The theoretical formalization of floating-point errors and classification tolerance accurately captures the conditions under which 16-bit precision approximates 32-bit results in actual neural network training dynamics.

What would settle it

A controlled experiment on a standard benchmark where standalone 16-bit training produces clearly lower accuracy than 32-bit training under matched conditions would disprove the central claim.

Figures

Figures reproduced from arXiv: 2305.10947 by Byungkon Kang, Francois Rameau, Juyoung Yun, Sol Choi, Zhoulai Fu.

Figure 1
Figure 1. Figure 1: Comparison of 16-bit floating-point neural networks with other precisions across different [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: DNN Accuracies on MNIST Dataset: 32-bit vs. 16-bit floating-point 7 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparative test accuracy over 100 epochs on CNNs and Vision Transformer (ViT) [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Boxplot of Test Accuracy: This figure illustrates the performance of CNN models and the [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

With the increasing complexity of machine learning models, managing computational resources like memory and processing power has become a critical concern. Mixed precision techniques, which leverage different numerical precisions during model training and inference to optimize resource usage, have been widely adopted. However, access to hardware that supports lower precision formats (e.g., FP8 or FP4) remains limited, especially for practitioners with hardware constraints. For many with limited resources, the available options are restricted to using 32-bit, 16-bit, or a combination of the two. While it is commonly believed that 16-bit precision can achieve results comparable to full (32-bit) precision, this study is the first to systematically validate this assumption through both rigorous theoretical analysis and extensive empirical evaluation. Our theoretical formalization of floating-point errors and classification tolerance provides new insights into the conditions under which 16-bit precision can approximate 32-bit results. This study fills a critical gap, proving for the first time that standalone 16-bit precision neural networks match 32-bit and mixed-precision in accuracy while boosting computational speed. Given the widespread availability of 16-bit across GPUs, these findings are especially valuable for machine learning practitioners with limited hardware resources to make informed decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that standalone FP16 neural network training achieves accuracy matching FP32 and mixed-precision training, supported by a theoretical formalization of per-step floating-point rounding errors linked to classification tolerance, plus extensive empirical evaluations across models and datasets. It positions this as the first rigorous validation of the assumption that 16-bit precision suffices for resource-limited settings, with benefits in speed and memory.

Significance. If the central claim holds, the work would offer practical value for practitioners without access to specialized low-precision hardware, confirming that FP16 can be used standalone without accuracy degradation. The empirical component appears extensive, but the theoretical contribution is limited by its scope.

major comments (1)
  1. [Theoretical Analysis] Theoretical section: the analysis derives bounds on rounding error for a single forward/backward pass and connects them to classification tolerance, but provides no inductive argument, Lyapunov-style bound, or analysis of error accumulation over the full training trajectory (thousands of optimizer steps in non-convex landscapes). This is load-bearing for the claim that final accuracy remains unaffected.
minor comments (2)
  1. [Abstract] Abstract and introduction: the claim of being 'the first to systematically validate' should be supported by a more explicit comparison to prior mixed-precision and low-precision training literature.
  2. [Theoretical Analysis] Notation: clarify whether the classification tolerance parameter is derived from data statistics or treated as a hyperparameter, as this affects the generality of the theoretical result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment point by point below, providing an honest assessment of the theoretical scope while highlighting the supporting empirical evidence.

read point-by-point responses
  1. Referee: [Theoretical Analysis] Theoretical section: the analysis derives bounds on rounding error for a single forward/backward pass and connects them to classification tolerance, but provides no inductive argument, Lyapunov-style bound, or analysis of error accumulation over the full training trajectory (thousands of optimizer steps in non-convex landscapes). This is load-bearing for the claim that final accuracy remains unaffected.

    Authors: We agree that the theoretical analysis is limited to deriving per-step bounds on floating-point rounding errors and linking them to classification tolerance, without an inductive argument, Lyapunov-style stability bound, or explicit analysis of error accumulation across the full non-convex training trajectory. This is a genuine limitation of the current theoretical contribution, as a complete characterization of long-term error propagation remains an open challenge in optimization theory. Our manuscript positions the per-step formalization as providing new insights into when 16-bit precision can approximate 32-bit results, with the primary validation coming from the extensive empirical evaluations across models and datasets. We do not claim the theory alone proves invariance over thousands of steps. In revision, we will add an explicit discussion paragraph acknowledging this scope limitation and noting that the empirical results serve as the main support for the practical claim of comparable final accuracy. This constitutes a partial revision focused on clarifying the theoretical boundaries rather than extending the analysis. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on empirical validation and single-pass error bounds without self-referential reduction

full rationale

The abstract and provided context contain no equations, derivations, or self-citations that reduce a claimed result to its own inputs by construction. The theoretical formalization of per-step floating-point error and classification tolerance is presented as an independent analysis, and the central accuracy-matching claim is tied to extensive empirical evaluation rather than any fitted parameter or ansatz smuggled via prior self-work. No load-bearing step matches the enumerated circularity patterns; the skeptic concern about accumulation bounds is a completeness issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; full text unavailable so ledger is minimal. The work invokes a theoretical formalization of floating-point errors.

axioms (1)
  • domain assumption Floating-point errors in 16-bit precision can be formalized relative to classification tolerance in neural network training.
    Stated in abstract as providing new insights into conditions for 16-bit approximation of 32-bit results.

pith-pipeline@v0.9.0 · 5768 in / 1065 out tokens · 24433 ms · 2026-05-24T08:24:21.086817+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

  1. [1]

    IEEE P3109: Standard for arithmetic formats for machine learning,

    IEEE Standards Association. IEEE P3109: Standard for arithmetic formats for machine learning,

  2. [2]

    Accessed: May 30, 2024

    Available at: https://standards.ieee.org/ieee/3109/11010/. Accessed: May 30, 2024

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

  4. [4]

    FxpNet: Training a deep convolu- tional neural network in fixed-point representation

    Xi Chen, Xiaolin Hu, Hucheng Zhou, and Ningyi Xu. FxpNet: Training a deep convolu- tional neural network in fixed-point representation. In Proceedings of the International Joint Conference on Neural Networks, 2017

  5. [5]

    Pact: Parameterized clipping activation for quantized neural networks

    Jungwook Choi, Zhiwei Wang, Swagath Venkataramani, Puneet Chuang, Vijayalakshmi Srini- vasa, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. In International Conference on Learning Representations, 2018

  6. [6]

    Xception: Deep learning with depthwise separable convolutions

    Francois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1251–1258, 2017

  7. [7]

    IEEE standard for floating-point arithmetic

    IEEE Computer Society. IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008), pages 1–84, 2019

  8. [8]

    BinaryConnect: Training deep neural networks with binary weights during propagations

    Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training deep neural networks with binary weights during propagations. In Proceedings of Neural Information Processing Systems, 2015

  9. [9]

    Binaryconnect: Training deep neural networks with binary weights during propagations

    Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015

  10. [10]

    Trainable fixed-point quantization for deep learning acceleration on fpgas

    Dingyi Dai, Yichi Zhang, Jiahao Zhang, Zhanqiu Hu, Yaohui Cai, Qi Sun, and Zhiru Zhang. Trainable fixed-point quantization for deep learning acceleration on fpgas. Arxiv Preprint, 2024

  11. [11]

    Mixed precision training of convolutional neural networks using integer operations

    Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov, Roma Dubtsov, Evarist Fomenko, and Vadim Pirogov. Mixed precision training of convolutional neural networks us...

  12. [12]

    Understanding and optimizing asynchronous low-precision stochastic gradient descent

    Christopher De Sa, Matthew Feldman, Christopher Ré, and Kunle Olukotun. Understanding and optimizing asynchronous low-precision stochastic gradient descent. In Proceedings of International Symposium on Computer Architecture, 2017

  13. [13]

    LLM.int8(): 8-bit matrix multiplication for transformers at scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems, 2024

  14. [14]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

  15. [15]

    Meta-llama-3-70b-fp8

    FriendliAI. Meta-llama-3-70b-fp8. Hugging Face, 2024. Available at:https://huggingface. co/meta-llama/Meta-Llama-3-70B-fp8 . Accessed: 2024-05-30. 14

  16. [16]

    The state of sparsity in deep neural networks

    Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. In International Conference on Learning Representations, 2019

  17. [17]

    Mahoney, and Kurt Keutzer

    Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. Arxiv Preprint, 2021

  18. [18]

    Deep Learning

    Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016

  19. [19]

    Mixed precision training guide, 2023

    Google. Mixed precision training guide, 2023. Available at: https://www.tensorflow. org/guide/mixed_precision. Accessed: Aug 15, 2023

  20. [20]

    Deep learning with limited numerical precision

    Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Proceedings of International Conference on Machine Learning, 2015

  21. [21]

    Learning both weights and connections for efficient neural network

    Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015

  22. [22]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian. Sun. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision, 2016

  23. [23]

    Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, second edition, 2002

  24. [24]

    Neural networks for machine learning, 2018

    Geoffrey Hinton. Neural networks for machine learning, 2018. Lecture 6a: Overview of mini-batch gradient descent

  25. [25]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015

  26. [26]

    Densely connected convolutional networks

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017

  27. [27]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

  28. [28]

    Howard, Hartwig Adam, and Dmitry Kalenichenko

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew G. Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018

  29. [29]

    Kingma and Jimmy Lei Ba

    Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representations, 2015

  30. [30]

    Bf16: Revisiting bf16 training

    Ulrich Koster et al. Bf16: Revisiting bf16 training. Proceedings of the International Conference on Machine Learning, 2020

  31. [31]

    Krizhevsky and G Hinton

    A. Krizhevsky and G Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  32. [32]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25:1097– 1105, 2012

  33. [33]

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of Neural Information Processing Systems, 2012

  34. [34]

    Webb, Xin Wang, Marcel Nassar, Arjun K

    Urs Köster, Tristan J. Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William H. Constable, O˘guz H. Elibol, Scott Gray, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai, and Naveen Rao. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Proceedings of Neural Information Processing Systems, 2017

  35. [35]

    Lecun, L

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998

  36. [36]

    ApiQ: Finetuning of 2-bit quantized large language model

    Baohao Liao and Christof Monz. ApiQ: Finetuning of 2-bit quantized large language model. Arxiv Preprint, 2024. 15

  37. [37]

    Lin, Sachin S

    Darryl D. Lin, Sachin S. Talathi, and V . Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In Proceedings of the International Conference on Machine Learning, 2016

  38. [38]

    The era of 1-bit llms: All large language models are in 1.58 bits

    Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits. Arxiv Preprint, 2024

  39. [39]

    Mixed precision training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In Proceedings of International Conference on Learning Representations, 2018

  40. [40]

    NVIDIA ampere ga102 gpu architecture, 2020

    NVIDIA. NVIDIA ampere ga102 gpu architecture, 2020. Available at: https://www.nvidia. com/en-us/geforce/technologies/ampere-architecture/. Accessed: Sep 27, 2024

  41. [41]

    Train with mixed precision, 2023

    NVIDIA. Train with mixed precision, 2023. Available at: https://docs.nvidia.com/ deeplearning/performance/mixed-precision-training/index.html. Accessed: Aug 15, 2023

  42. [42]

    Tensor cores, 2024

    NVIDIA. Tensor cores, 2024. Available at: https://www.nvidia.com/en-gb/ data-center/tensor-cores/. Accessed: 2024-09-27

  43. [43]

    Tuning cuda applications for nvidia ampere gpu architecture, 2024

    NVIDIA. Tuning cuda applications for nvidia ampere gpu architecture, 2024. Available at: https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html. Accessed: Sept 27, 2024

  44. [44]

    Padgett and David V

    Wayne T. Padgett and David V . Anderson.Fixed-Point Signal Processing. Synthesis Lectures on Signal Processing. Springer Cham, 1 edition, 2009

  45. [45]

    Accelerating llama3 fp8 inference with triton kernels

    PyTorch. Accelerating llama3 fp8 inference with triton kernels. PyTorch Blog, 2024. Available at: https://pytorch.org/blog/accelerating-llama3-fp8-inference/ . Accessed: May 30, 2024

  46. [46]

    A method for speeding up the convergence of back-propagation learning

    Ning Qian. A method for speeding up the convergence of back-propagation learning. Neural Networks, 6(4):861–867, 1999

  47. [47]

    XNOR-Net: Ima- genet classification using binary convolutional neural networks

    Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: Ima- genet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542, 2016

  48. [48]

    Efficient deep learning inference on embedded systems using fixed-point arithmetic on fpgas.Journal of Signal Processing Systems, 91(1):1–13, 2019

    Sascha Ristov, Erez Malkin, and Zeljko Zilic. Efficient deep learning inference on embedded systems using fixed-point arithmetic on fpgas.Journal of Signal Processing Systems, 91(1):1–13, 2019

  49. [49]

    Sabbagh Molahosseini, L

    A. Sabbagh Molahosseini, L. Sousa, A.A. Emrani Zarandi, and H. Vandierendonck. Low- precision floating-point formats: From general-purpose to application-specific. In W. Liu and F. Lombardi, editors, Approximate Computing, pages 109–130. Springer, Cham, 2022

  50. [50]

    Mobilenetv2: Inverted residuals and linear bottlenecks

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018

  51. [51]

    Bit Fusion: Bit-level dynamically composable archi- tecture for accelerating deep neural networks

    Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Joon Kyung Kim, Vikas Chandra, and Hadi Esmaeilzadeh. Bit Fusion: Bit-level dynamically composable archi- tecture for accelerating deep neural networks. In Proceedings of International Symposium on Computer Architecture, 2017

  52. [52]

    Very deep convolutional networks for large-scale image recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of International Conference on Learning Representations, 2015

  53. [53]

    Training data-efficient image transformers and distillation through attention

    Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Training data-efficient image transformers and distillation through attention. International Conference on Machine Learning, 2021

  54. [54]

    Training deep neural networks with 8-bit floating point numbers

    Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training deep neural networks with 8-bit floating point numbers. In Proceedings of the Interna- tional Conference on Neural Information Processing Systems, page 7686–7695, 2018. 16

  55. [55]

    Training and inference with integers in deep neural networks

    Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. In Proceedings of International Conference on Learning Representations, 2018

  56. [56]

    Training transformers with 4-bit integers

    Haocheng Xi, Changhao Li, Jianfei Chen, and Jun Zhu. Training transformers with 4-bit integers. In Advances in Neural Information Processing Systems, 2024

  57. [57]

    SmoothQuant: accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: accurate and efficient post-training quantization for large language models. In Proceedings of the International Conference on Machine Learning, 2023

  58. [58]

    Q8BERT: Quantized 8bit BERT

    Dan Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8BERT: Quantized 8bit BERT. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS 2019, pages 36–39, 2019

  59. [59]

    Ternarybert: Distillation-aware ultra-low bit bert

    Wei Zhang, Canwen Liu, Yuwei Ma, Fuwei Zhang, Shuai Li, and Yue Zhang. Ternarybert: Distillation-aware ultra-low bit bert. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 509–521, 2020

  60. [60]

    DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients

    Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. In Arxiv Preprint, 2016. 17