pith. sign in

hub Canonical reference

FP8 Formats for Deep Learning

Canonical reference. 80% of citing Pith papers cite this work as background.

49 Pith papers citing it
Background 80% of classified citations
abstract

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization.

hub tools

citation-role summary

background 9 other 1

citation-polarity summary

polarities

background 8 unclear 2

clear filters

representative citing papers

Novel Aspects of IEEE SA P3109 Arithmetic Formats for Machine Learning

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

IEEE P3109 defines a family of adjustable low-precision floating-point formats for ML with decoding to extended reals, multiple rounding modes, block operations, kappa-approximation for approximations, and mechanical verification.

AIS: Adaptive Importance Sampling for Quantized RL

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier in joint scaling with generators.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • Stochastic Rounding Increases Small Singular Values math.NA · 2026-05-29 · unverdicted · none · ref 6 · internal anchor

    Stochastic rounding lifts clusters of small singular values even in constant aspect ratio matrices, extending its role as a spectral regularizer.