Training and inference of large language models using 8-bit floating point

Andrew William Fitzgibbon; Carlo Luschi; Charlie Blake; James Briggs; Josh Levy-Kramer; Paul Balanca; Sergio P. Perez; Stephen Barlow; Yan Zhang

arxiv: 2309.17224 · v1 · pith:IGENA62Enew · submitted 2023-09-29 · 💻 cs.LG · cs.AR· cs.CL· cs.ET· cs.PF

Training and inference of large language models using 8-bit floating point

Sergio P. Perez , Yan Zhang , James Briggs , Charlie Blake , Josh Levy-Kramer , Paul Balanca , Carlo Luschi , Stephen Barlow

show 1 more author

Andrew William Fitzgibbon

This is my paper

classification 💻 cs.LG cs.ARcs.CLcs.ETcs.PF

keywords formatsinferencelargemodelstrainingactivationsgradientslanguage

0 comments

read the original abstract

FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced dynamic range compared to higher-precision formats. Although there exists ample literature about selecting such scalings for INT formats, this critical aspect has yet to be addressed for FP8. This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations. We apply this methodology to train and validate large language models of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B. To facilitate the understanding of the FP8 dynamics, our results are accompanied by plots of the per-tensor scale distribution for weights, activations and gradients during both training and inference.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
cs.LG 2025-10 unverdicted novelty 7.0

Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.
GNMR: Runtime Stability Control for Low-Precision Large Language Model Training
cs.LG 2026-05 unverdicted novelty 5.0

GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.