pith. sign in

arxiv: 2309.17224 · v1 · pith:IGENA62Enew · submitted 2023-09-29 · 💻 cs.LG · cs.AR· cs.CL· cs.ET· cs.PF

Training and inference of large language models using 8-bit floating point

classification 💻 cs.LG cs.ARcs.CLcs.ETcs.PF
keywords formatsinferencelargemodelstrainingactivationsgradientslanguage
0
0 comments X
read the original abstract

FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced dynamic range compared to higher-precision formats. Although there exists ample literature about selecting such scalings for INT formats, this critical aspect has yet to be addressed for FP8. This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations. We apply this methodology to train and validate large language models of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B. To facilitate the understanding of the FP8 dynamics, our results are accompanied by plots of the per-tensor scale distribution for weights, activations and gradients during both training and inference.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

    cs.LG 2025-10 unverdicted novelty 7.0

    Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.

  2. StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.

  3. GNMR: Runtime Stability Control for Low-Precision Large Language Model Training

    cs.LG 2026-05 unverdicted novelty 5.0

    GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.