Recognition: 1 theorem link
· Lean TheoremFP8 Formats for Deep Learning
Pith reviewed 2026-05-15 09:42 UTC · model grok-4.3
The pith
FP8 with E4M3 and E5M2 encodings matches 16-bit training accuracy on large language and image models without hyperparameter changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representation of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the
What carries the argument
The FP8 format consisting of E4M3 and E5M2 encodings that balance range and precision for neural network training and inference.
If this is right
- FP8 training matches 16-bit accuracy on CNNs, RNNs, and Transformers without changing hyperparameters.
- The format supports post-training quantization for language models that resist int8 quantization.
- Accuracy is preserved for models up to 175 billion parameters.
- FP8 enables acceleration of both training and inference beyond 16-bit formats.
Where Pith is reading between the lines
- Adopting FP8 could halve the memory and compute costs for large AI model training compared to 16-bit formats.
- The format's design may serve as a template for even lower precision formats in future deep learning hardware.
- Integration into standard processors would allow seamless switching from 16-bit to FP8 in existing workflows.
- Further validation on additional tasks like reinforcement learning could extend the applicability.
Load-bearing premise
That the chosen E4M3 and E5M2 encodings will preserve accuracy across all tasks and model scales without any hyperparameter retuning or task-specific adjustments.
What would settle it
Observing a significant accuracy drop when training a 175B parameter language model in FP8 compared to 16-bit, with all other training settings unchanged, would falsify the main claim.
read the original abstract
FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes two 8-bit floating-point interchange formats, E4M3 (4-bit exponent, 3-bit mantissa, no infinities, single NaN pattern) and E5M2 (5-bit exponent, 2-bit mantissa, IEEE 754 conventions), for deep-learning training and inference. It reports that these formats match FP16 accuracy on CNNs, RNNs, and Transformer-based models (including language models up to 175B parameters) when all optimizer, learning-rate, batch-size, and loss-scaling hyperparameters are left unchanged from the 16-bit baselines; it also examines FP8 post-training quantization on models resistant to int8.
Significance. If the reported matching accuracy holds across the claimed scales and architectures, the work provides a concrete, immediately usable path to accelerate both training and inference on hardware that supports FP8, with direct relevance to scaling large models while preserving quality and without requiring hyperparameter retuning.
minor comments (3)
- [Abstract] Abstract: the efficacy claim would be stronger if it included one or two concrete accuracy numbers (e.g., top-1 on ImageNet or perplexity on a language-modeling benchmark) rather than the general statement of 'matching the result quality.'
- [Section 3] Section 3 (format definitions): a small table comparing the dynamic range and precision of E4M3/E5M2 against FP16 and bfloat16 would help readers quickly assess why the chosen encodings are expected to suffice for the reported tasks.
- [Experiments] Experimental sections: while the paper states that hyperparameters were left unchanged, it would be useful to list the exact loss-scaling factors used for each model family to allow exact reproduction.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and the recommendation for minor revision. We appreciate the recognition of the practical relevance of the proposed FP8 formats for both training and inference at scale.
Circularity Check
No significant circularity
full rationale
The paper defines explicit FP8 encodings (E4M3 and E5M2) with fixed bit allocations and special-value rules, then reports direct empirical accuracy matches to 16-bit baselines on CNNs, RNNs, and Transformers (including 175B models) using identical hyperparameters. No mathematical derivations, fitted parameters, or predictions appear; all load-bearing claims are observational results from controlled experiments rather than quantities that reduce to the inputs by construction. No self-citations are invoked to justify uniqueness or force the format choice. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard rules for floating-point representation and rounding apply except for special values in E4M3
invented entities (1)
-
E4M3 encoding without infinities
no independent evidence
Forward citations
Cited by 21 Pith papers
-
AIS: Adaptive Importance Sampling for Quantized RL
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
-
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
-
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
-
TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines
TransDot unifies SIMD FMA and trans-precision DPA in one reconfigurable FPU, achieving 2x FP16, 4x FP8, and 8x FP4 throughput with FP32 accumulation plus 1.46x to 2.92x area efficiency gains over the FPnew baseline.
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
FalconGEMM is a framework with deployment, execution, and decision modules that makes lower-complexity matrix multiplication practical, outperforming standard GEMM libraries by 7.59-17.85% and competitors like AlphaTe...
-
FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.8...
-
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...
-
ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters
ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier i...
-
Neural-Network-Based Variational Method in Nuclear Density Functional Theory: Application to the Extended Thomas-Fermi Model
Neural networks parametrize nuclear densities and are variationally optimized to solve the extended Thomas-Fermi model, reproducing binding energies within 0.5% and pasta structures.
-
Neural-Network-Based Variational Method in Nuclear Density Functional Theory: Application to the Extended Thomas-Fermi Model
Neural networks represent densities in a variational extended Thomas-Fermi model, yielding binding energies within 0.5% of prior ETF results and reproducing nuclear pasta phases.
-
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.
-
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training
LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
-
STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training
STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.
-
AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation
AdaHOP applies pattern-aware Hadamard transforms and selective outlier extraction to enable from-scratch MXFP4 training of LLMs at BF16 quality with up to 3.6X memory compression and 1.46X speedup.
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
-
HiFloat4 Format for Language Model Pre-training on Ascend NPUs
HiFloat4 FP4 with stabilization techniques trains dense and MoE language models on Ascend NPUs at relative error within 1% of full-precision baselines.
-
Rewriting TTS Inference Economics: Lightning V2 on Tenstorrent Achieves 4x Lower Cost Than NVIDIA L40S
Lightning V2 achieves 4x lower on-prem accelerator cost for TTS inference on Tenstorrent hardware than NVIDIA L40S at equivalent throughput and production audio fidelity.
Reference graph
Works this paper leans on
-
[1]
Michael J. Anderson, Benny Chen, Stephen Chen, Summer Deng, Jordan Fix, Michael Gschwind, Aravind Kalaiah, Changkyu Kim, Jaewon Lee, Jason Liang, Haixin Liu, Yinghai Lu, Jack Montgomery, Arun Moorthy, Nadathur Satish, Sam Naghshineh, Avinash Nayak, Jongsoo Park, Chris Petersen, Martin Schatz, Narayanan Sundaram, Bangsheng Tang, Peter Tang, Amy Yang, Jieca...
-
[2]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott...
work page 1901
-
[3]
Bfloat16 processing for neural networks
Neil Burgess, Jelena Milanovic, Nigel Stephens, Konstantinos Monachopoulos, and David Mansell. Bfloat16 processing for neural networks. In Martin Langhammer Sylvie Boldo, editor, 26th IEEE Symposium on Computer Arithmetic, ARITH 2019, Kyoto, Japan, June 10-12, 2019 , pages 88–91. IEEE, 2017
work page 2019
-
[4]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...
work page 2022
-
[5]
Binaryconnect: Training deep neural networks with binary weights during propagations
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems , 28, 2015. 7
work page 2015
-
[6]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[7]
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. Advances in neural information processing systems , 29, 2016
work page 2016
-
[8]
Rethinking floating point for deep learning
Jeff Johnson. Rethinking floating point for deep learning. CoRR, abs/1811.01721, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Dhiraj D. Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja V ooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A study of BFLOAT16 fo...
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[10]
FP8 quantization: The power of the exponent
Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, and Tijmen Blankevoort. FP8 quantization: The power of the exponent. arXiv, 2208.09225, 2022
-
[11]
Lee, Daisuke Miyashita, Elaina Chai, Boris Murmann, and S
Edward H. Lee, Daisuke Miyashita, Elaina Chai, Boris Murmann, and S. Simon Wong. Lognet: Energy-efficient neural networks using logarithmic computation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017 , pages 5900–5904. IEEE, 2017
work page 2017
-
[12]
Mixed precision training with 8-bit floating point, 2019
Naveen Mellempudi, Sudarshan Srinivasan, Dipankar Das, and Bharat Kaul. Mixed precision training with 8-bit floating point, 2019
work page 2019
-
[13]
Mixed precision training: theory and prac- tice, 2018
Paulius Micikevicius. Mixed precision training: theory and prac- tice, 2018. https://on-demand.gputechconf.com/gtc/2018/presentation/ s8923-training-neural-networks-with-mixed-precision-theory-and-practice.pdf , Accessed on 2022-09-11
work page 2018
-
[14]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training.arxiv, 1710.03740, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Convolutional Neural Networks using Logarithmic Data Representation
Daisuke Miyashita, Edward H. Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. CoRR, abs/1603.01025, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
8-bit numerical formats for deep neural networks
Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, and Carlo Luschi. 8-bit numerical formats for deep neural networks. arXiv preprint arXiv:2206.02915, 2022
-
[17]
Xnor-net: Imagenet classification using binary convolutional neural networks
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016 , pages 525–542, Cham, 2016. Springer International Publishing
work page 2016
-
[18]
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zheng, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. Using deepspeed and megatron to ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Accelerating ai training with nvidia tf32 tensor cores, 2021
Dusan Stosic and Paulius Micikevicius. Accelerating ai training with nvidia tf32 tensor cores, 2021. https:// developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/ , Accessed on 2022-09-4
work page 2021
-
[20]
Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks
Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi (Viji) Srini- vasan, Xiaodong Cui, Wei Zhang, and Kailash Gopalakrishnan. Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neu...
work page 2019
-
[21]
Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021
work page 2021
-
[22]
Training deep neural networks with 8-bit floating point numbers
Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training deep neural networks with 8-bit floating point numbers. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018
work page 2018
-
[23]
Integer quantization for deep learning inference: Principles and empirical evaluation
Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius. Integer quantization for deep learning inference: Principles and empirical evaluation. arxiv, 2004.09602, 2020. 8
-
[24]
Lq-nets: Learned quantization for highly accurate and compact deep neural networks
Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. European conference on computer vision (ECCV) , pages 365–382, 2018
work page 2018
-
[25]
Opt: Open pre-trained transformer language models, 2022
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022
work page 2022
-
[26]
DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients
Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint, 1606.06160, 2016. 9
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.