arxiv: 2209.05433 · v2 · submitted 2022-09-12 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

FP8 Formats for Deep Learning

Paulius Micikevicius , Dusan Stosic , Neil Burgess , Marius Cornea , Pradeep Dubey , Richard Grisenthwaite , Sangwon Ha , Alexander Heinecke

show 7 more authors

Patrick Judd John Kamalu Naveen Mellempudi Stuart Oberman Mohammad Shoeybi Michael Siu Hao Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:42 UTC · model grok-4.3

classification 💻 cs.LG

keywords FP88-bit floating pointdeep learning trainingquantizationlarge language modelsCNNRNNTransformer

0 comments

The pith

FP8 with E4M3 and E5M2 encodings matches 16-bit training accuracy on large language and image models without hyperparameter changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an 8-bit floating point format for deep learning to speed up training and inference beyond current 16-bit standards. It defines two encodings, E4M3 and E5M2, with E4M3 extending the dynamic range by omitting infinities. Experiments show this format achieves the same result quality as 16-bit training across CNNs, RNNs, and Transformers, including models with up to 175 billion parameters, while keeping all other settings the same. It also works for post-training quantization of models that resist integer quantization.

Core claim

We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representation of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the

What carries the argument

The FP8 format consisting of E4M3 and E5M2 encodings that balance range and precision for neural network training and inference.

If this is right

FP8 training matches 16-bit accuracy on CNNs, RNNs, and Transformers without changing hyperparameters.
The format supports post-training quantization for language models that resist int8 quantization.
Accuracy is preserved for models up to 175 billion parameters.
FP8 enables acceleration of both training and inference beyond 16-bit formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adopting FP8 could halve the memory and compute costs for large AI model training compared to 16-bit formats.
The format's design may serve as a template for even lower precision formats in future deep learning hardware.
Integration into standard processors would allow seamless switching from 16-bit to FP8 in existing workflows.
Further validation on additional tasks like reinforcement learning could extend the applicability.

Load-bearing premise

That the chosen E4M3 and E5M2 encodings will preserve accuracy across all tasks and model scales without any hyperparameter retuning or task-specific adjustments.

What would settle it

Observing a significant accuracy drop when training a 175B parameter language model in FP8 compared to 16-bit, with all other training settings unchanged, would falsify the main claim.

read the original abstract

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FP8 with E4M3 and E5M2 matches FP16 accuracy on CNNs, RNNs, and Transformers up to 175B parameters when hyperparameters stay fixed.

read the letter

The main point is that these two FP8 encodings deliver training results indistinguishable from FP16 across the tested models while cutting precision in half. The paper shows this holds for image classification, language modeling, and other tasks, including models with 175 billion parameters, and the setup keeps the optimizer, learning rate, batch size, and loss scaling identical to the 16-bit runs. That direct comparison is the cleanest part of the work. The encodings themselves are the practical novelty: E4M3 drops infinities to stretch the range for weights and activations, and E5M2 follows standard rules for gradients. They also test post-training quantization on models that resisted int8, which adds a useful data point for inference. The experiments cover the main modern architectures without special tuning, so the results speak directly to whether the reduced dynamic range is enough in practice. One minor limitation is that the paper focuses on the cases where it works; readers will still want to see failure modes or tasks where the formats need extra handling. Overall the evidence looks reproducible from the described protocol, and the scale of the language-model runs gives it weight. This is the kind of paper hardware teams and large-model trainers will actually use. It belongs in peer review so the community can check the numbers and extend the formats.

Referee Report

0 major / 3 minor

Summary. The paper proposes two 8-bit floating-point interchange formats, E4M3 (4-bit exponent, 3-bit mantissa, no infinities, single NaN pattern) and E5M2 (5-bit exponent, 2-bit mantissa, IEEE 754 conventions), for deep-learning training and inference. It reports that these formats match FP16 accuracy on CNNs, RNNs, and Transformer-based models (including language models up to 175B parameters) when all optimizer, learning-rate, batch-size, and loss-scaling hyperparameters are left unchanged from the 16-bit baselines; it also examines FP8 post-training quantization on models resistant to int8.

Significance. If the reported matching accuracy holds across the claimed scales and architectures, the work provides a concrete, immediately usable path to accelerate both training and inference on hardware that supports FP8, with direct relevance to scaling large models while preserving quality and without requiring hyperparameter retuning.

minor comments (3)

[Abstract] Abstract: the efficacy claim would be stronger if it included one or two concrete accuracy numbers (e.g., top-1 on ImageNet or perplexity on a language-modeling benchmark) rather than the general statement of 'matching the result quality.'
[Section 3] Section 3 (format definitions): a small table comparing the dynamic range and precision of E4M3/E5M2 against FP16 and bfloat16 would help readers quickly assess why the chosen encodings are expected to suffice for the reported tasks.
[Experiments] Experimental sections: while the paper states that hyperparameters were left unchanged, it would be useful to list the exact loss-scaling factors used for each model family to allow exact reproduction.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. We appreciate the recognition of the practical relevance of the proposed FP8 formats for both training and inference at scale.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines explicit FP8 encodings (E4M3 and E5M2) with fixed bit allocations and special-value rules, then reports direct empirical accuracy matches to 16-bit baselines on CNNs, RNNs, and Transformers (including 175B models) using identical hyperparameters. No mathematical derivations, fitted parameters, or predictions appear; all load-bearing claims are observational results from controlled experiments rather than quantities that reduce to the inputs by construction. No self-citations are invoked to justify uniqueness or force the format choice. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The formats are defined by fixed bit allocations with one deviation from IEEE 754; no numerical fitting or new physical entities are introduced.

axioms (1)

standard math Standard rules for floating-point representation and rounding apply except for special values in E4M3
Invoked when defining E4M3 without infinities and single NaN pattern

invented entities (1)

E4M3 encoding without infinities no independent evidence
purpose: Extend dynamic range for activations and gradients in deep learning
Introduced to better cover the numerical range encountered in neural network training

pith-pipeline@v0.9.0 · 5551 in / 1249 out tokens · 34536 ms · 2026-05-15T09:42:41.400605+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AIS: Adaptive Importance Sampling for Quantized RL
stat.ML 2026-05 unverdicted novelty 7.0

AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
cs.DC 2026-05 unverdicted novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
cs.CL 2026-05 unverdicted novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines
cs.AR 2026-05 unverdicted novelty 7.0

TransDot unifies SIMD FMA and trans-precision DPA in one reconfigurable FPU, achieving 2x FP16, 4x FP8, and 8x FP4 throughput with FP32 accumulation plus 1.46x to 2.92x area efficiency gains over the FPnew baseline.
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
cs.AR 2026-03 unverdicted novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
Search Your Block Floating Point Scales!
cs.LG 2026-05 unverdicted novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
ShardTensor: Domain Parallelism for Scientific Machine Learning
cs.DC 2026-05 unverdicted novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
cs.DC 2026-05 unverdicted novelty 6.0

FalconGEMM is a framework with deployment, execution, and decision modules that makes lower-complexity matrix multiplication practical, outperforming standard GEMM libraries by 7.59-17.85% and competitors like AlphaTe...
FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
cs.DC 2026-05 unverdicted novelty 6.0

FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.8...
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
stat.ML 2026-05 unverdicted novelty 6.0

Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...
ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters
cs.CV 2026-05 unverdicted novelty 6.0

ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier i...
Neural-Network-Based Variational Method in Nuclear Density Functional Theory: Application to the Extended Thomas-Fermi Model
nucl-th 2026-04 unverdicted novelty 6.0

Neural networks parametrize nuclear densities and are variationally optimized to solve the extended Thomas-Fermi model, reproducing binding energies within 0.5% and pasta structures.
Neural-Network-Based Variational Method in Nuclear Density Functional Theory: Application to the Extended Thomas-Fermi Model
nucl-th 2026-04 unverdicted novelty 6.0

Neural networks represent densities in a variational extended Thomas-Fermi model, yielding binding energies within 0.5% of prior ETF results and reproducing nuclear pasta phases.
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training
cs.AR 2026-04 unverdicted novelty 6.0

LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training
cs.LG 2026-04 unverdicted novelty 6.0

STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.
AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation
cs.LG 2026-04 unverdicted novelty 6.0

AdaHOP applies pattern-aware Hadamard transforms and selective outlier extraction to enable from-scratch MXFP4 training of LLMs at BF16 quality with up to 3.6X memory compression and 1.46X speedup.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
cs.DC 2026-04 unverdicted novelty 5.0

TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
HiFloat4 Format for Language Model Pre-training on Ascend NPUs
cs.LG 2026-04 unverdicted novelty 4.0

HiFloat4 FP4 with stabilization techniques trains dense and MoE language models on Ascend NPUs at relative error within 1% of full-precision baselines.
Rewriting TTS Inference Economics: Lightning V2 on Tenstorrent Achieves 4x Lower Cost Than NVIDIA L40S
eess.AS 2026-03 unverdicted novelty 3.0

Lightning V2 achieves 4x lower on-prem accelerator cost for TTS inference on Tenstorrent hardware than NVIDIA L40S at equivalent throughput and production audio fidelity.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 19 Pith papers · 7 internal anchors

[1]

Michael J. Anderson, Benny Chen, Stephen Chen, Summer Deng, Jordan Fix, Michael Gschwind, Aravind Kalaiah, Changkyu Kim, Jaewon Lee, Jason Liang, Haixin Liu, Yinghai Lu, Jack Montgomery, Arun Moorthy, Nadathur Satish, Sam Naghshineh, Avinash Nayak, Jongsoo Park, Chris Petersen, Martin Schatz, Narayanan Sundaram, Bangsheng Tang, Peter Tang, Amy Yang, Jieca...

work page arXiv 2021
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott...

work page 1901
[3]

Bﬂoat16 processing for neural networks

Neil Burgess, Jelena Milanovic, Nigel Stephens, Konstantinos Monachopoulos, and David Mansell. Bﬂoat16 processing for neural networks. In Martin Langhammer Sylvie Boldo, editor, 26th IEEE Symposium on Computer Arithmetic, ARITH 2019, Kyoto, Japan, June 10-12, 2019 , pages 88–91. IEEE, 2017

work page 2019
[4]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page 2022
[5]

Binaryconnect: Training deep neural networks with binary weights during propagations

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems , 28, 2015. 7

work page 2015
[6]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[7]

Binarized neural networks

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. Advances in neural information processing systems , 29, 2016

work page 2016
[8]

Rethinking floating point for deep learning

Jeff Johnson. Rethinking ﬂoating point for deep learning. CoRR, abs/1811.01721, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Dhiraj D. Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja V ooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A study of BFLOAT16 fo...

work page internal anchor Pith review Pith/arXiv arXiv 1905
[10]

FP8 quantization: The power of the exponent

Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, and Tijmen Blankevoort. FP8 quantization: The power of the exponent. arXiv, 2208.09225, 2022

work page arXiv 2022
[11]

Lee, Daisuke Miyashita, Elaina Chai, Boris Murmann, and S

Edward H. Lee, Daisuke Miyashita, Elaina Chai, Boris Murmann, and S. Simon Wong. Lognet: Energy-efﬁcient neural networks using logarithmic computation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017 , pages 5900–5904. IEEE, 2017

work page 2017
[12]

Mixed precision training with 8-bit ﬂoating point, 2019

Naveen Mellempudi, Sudarshan Srinivasan, Dipankar Das, and Bharat Kaul. Mixed precision training with 8-bit ﬂoating point, 2019

work page 2019
[13]

Mixed precision training: theory and prac- tice, 2018

Paulius Micikevicius. Mixed precision training: theory and prac- tice, 2018. https://on-demand.gputechconf.com/gtc/2018/presentation/ s8923-training-neural-networks-with-mixed-precision-theory-and-practice.pdf , Accessed on 2022-09-11

work page 2018
[14]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training.arxiv, 1710.03740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Convolutional Neural Networks using Logarithmic Data Representation

Daisuke Miyashita, Edward H. Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. CoRR, abs/1603.01025, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

8-bit numerical formats for deep neural networks

Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, and Carlo Luschi. 8-bit numerical formats for deep neural networks. arXiv preprint arXiv:2206.02915, 2022

work page arXiv 2022
[17]

Xnor-net: Imagenet classiﬁcation using binary convolutional neural networks

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classiﬁcation using binary convolutional neural networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016 , pages 525–542, Cham, 2016. Springer International Publishing

work page 2016
[18]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zheng, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. Using deepspeed and megatron to ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Accelerating ai training with nvidia tf32 tensor cores, 2021

Dusan Stosic and Paulius Micikevicius. Accelerating ai training with nvidia tf32 tensor cores, 2021. https:// developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/ , Accessed on 2022-09-4

work page 2021
[20]

Hybrid 8-bit ﬂoating point (hfp8) training and inference for deep neural networks

Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi (Viji) Srini- vasan, Xiaodong Cui, Wei Zhang, and Kailash Gopalakrishnan. Hybrid 8-bit ﬂoating point (hfp8) training and inference for deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neu...

work page 2019
[21]

Training data-efﬁcient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efﬁcient image transformers & distillation through attention. In International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021

work page 2021
[22]

Training deep neural networks with 8-bit ﬂoating point numbers

Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training deep neural networks with 8-bit ﬂoating point numbers. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018

work page 2018
[23]

Integer quantization for deep learning inference: Principles and empirical evaluation

Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius. Integer quantization for deep learning inference: Principles and empirical evaluation. arxiv, 2004.09602, 2020. 8

work page arXiv 2004
[24]

Lq-nets: Learned quantization for highly accurate and compact deep neural networks

Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. European conference on computer vision (ECCV) , pages 365–382, 2018

work page 2018
[25]

Opt: Open pre-trained transformer language models, 2022

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022

work page 2022
[26]

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint, 1606.06160, 2016. 9

work page internal anchor Pith review Pith/arXiv arXiv 2016