hub Canonical reference

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite · 2022 · cs.LG · arXiv 2209.05433

Canonical reference. 80% of citing Pith papers cite this work as background.

49 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 49 citing papers arXiv PDF

abstract

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 other 1

citation-polarity summary

background 8 unclear 2

representative citing papers

Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy

cs.AR · 2025-11-14 · accept · novelty 8.0

The authors derive the first bit-accurate arithmetic models for matrix multiply-accumulate operations on ten GPU architectures spanning NVIDIA Volta to Blackwell and AMD CDNA1 to CDNA3.

Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.

Novel Aspects of IEEE SA P3109 Arithmetic Formats for Machine Learning

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

IEEE P3109 defines a family of adjustable low-precision floating-point formats for ML with decoding to extended reals, multiple rounding modes, block operations, kappa-approximation for approximations, and mechanical verification.

Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

Floating-point neural networks achieve universal representability for practical activations like ReLU, sigmoid, and tanh under arbitrary reduction orders and bounded ulp errors in activations via a new distinguishability condition.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.

AIS: Adaptive Importance Sampling for Quantized RL

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.

TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines

cs.AR · 2026-05-08 · unverdicted · novelty 7.0

TransDot unifies SIMD FMA and trans-precision DPA in one reconfigurable FPU, achieving 2x FP16, 4x FP8, and 8x FP4 throughput with FP32 accumulation plus 1.46x to 2.92x area efficiency gains over the FPnew baseline.

ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

cs.AR · 2026-03-28 · unverdicted · novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.

DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling

cs.LG · 2025-09-03 · unverdicted · novelty 7.0

DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

cs.LG · 2024-07-11 · accept · novelty 7.0

FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

dMX is a differentiable mixed-precision framework that learns per-layer MXFP bit-width assignments for LLMs and outperforms KL-based heuristics on perplexity and zero-shot accuracy under bit-width budgets.

Recursive Block-Diagonal Coupling for Resource-Efficient Training of Vision Models

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

RBDC trains wide vision models by recursive block-diagonal coupling of narrower pre-trained models, reducing training FLOPs by 30% at similar ImageNet accuracy for DeiT and ResNet while outperforming model growth baselines.

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

cs.LG · 2026-05-19 · unverdicted · novelty 6.0 · 3 refs

MXFP4 quantization error decomposes into scale bias, deadzone truncation, and grid noise; mode-targeted corrections recover BF16 accuracy within 0.7% on Qwen2.5-3B and exceed it by 1.0% on Qwen3-30B-A3B.

Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

IAMFlow is a training-free identity-aware memory system that tracks entities via LLM global ID assignment and VLM frame verification to reduce identity drift in narrative long video generation from shifting prompts.

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

ShardTensor: Domain Parallelism for Scientific Machine Learning

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.

FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.85% on LLM tasks.

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

stat.ML · 2026-05-07 · unverdicted · novelty 6.0

Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier in joint scaling with generators.

Neural-Network-Based Variational Method in Nuclear Density Functional Theory: Application to the Extended Thomas-Fermi Model

nucl-th · 2026-04-28 · unverdicted · novelty 6.0 · 2 refs

Neural networks represent densities in a variational extended Thomas-Fermi model, yielding binding energies within 0.5% of prior ETF results and reproducing nuclear pasta phases.

StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.

citing papers explorer

Showing 18 of 18 citing papers after filters.

Novel Aspects of IEEE SA P3109 Arithmetic Formats for Machine Learning cs.LG · 2026-06-01 · unverdicted · partial · ref 3 · internal anchor
IEEE P3109 defines a family of adjustable low-precision floating-point formats for ML with decoding to extended reals, multiple rounding modes, block operations, kappa-approximation for approximations, and mechanical verification.
Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations cs.LG · 2026-05-27 · unverdicted · none · ref 18 · internal anchor
Floating-point neural networks achieve universal representability for practical activations like ReLU, sigmoid, and tanh under arbitrary reduction orders and bounded ulp errors in activations via a new distinguishability condition.
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention cs.LG · 2025-10-05 · unverdicted · none · ref 16 · internal anchor
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling cs.LG · 2025-09-03 · unverdicted · none · ref 32 · internal anchor
DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision cs.LG · 2024-07-11 · accept · none · ref 37 · internal anchor
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats cs.LG · 2026-06-02 · unverdicted · none · ref 20 · internal anchor
dMX is a differentiable mixed-precision framework that learns per-layer MXFP bit-width assignments for LLMs and outperforms KL-based heuristics on perplexity and zero-shot accuracy under bit-width budgets.
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor cs.LG · 2026-05-19 · unverdicted · none · ref 28 · 3 links · internal anchor
MXFP4 quantization error decomposes into scale bias, deadzone truncation, and grid noise; mode-targeted corrections recover BF16 accuracy within 0.7% on Qwen2.5-3B and exceed it by 1.0% on Qwen3-30B-A3B.
Search Your Block Floating Point Scales! cs.LG · 2026-05-12 · unverdicted · none · ref 144 · internal anchor
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models cs.LG · 2026-04-16 · unverdicted · none · ref 26 · internal anchor
StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.
STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training cs.LG · 2026-04-08 · unverdicted · none · ref 18 · internal anchor
STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.
AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation cs.LG · 2026-04-02 · unverdicted · none · ref 27 · internal anchor
AdaHOP applies pattern-aware Hadamard transforms and selective outlier extraction to enable from-scratch MXFP4 training of LLMs at BF16 quality with up to 3.6X memory compression and 1.46X speedup.
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models cs.LG · 2025-12-13 · unverdicted · none · ref 16 · internal anchor
BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.
GNMR: Runtime Stability Control for Low-Precision Large Language Model Training cs.LG · 2026-05-30 · unverdicted · none · ref 38 · internal anchor
GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.
Training-Inference Kernel Contracts: Bounding Divergence in Post-Training and Deployment cs.LG · 2026-05-26 · unverdicted · none · ref 15 · internal anchor
Introduces kernel contracts framework with derived bounds on divergence from logit drift to reward drift, specialized for RL post-training under support and norm assumptions.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 21 · internal anchor
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining cs.LG · 2026-02-11 · conditional · none · ref 30 · internal anchor
SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.
Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training cs.LG · 2026-05-25 · unverdicted · none · ref 8 · internal anchor
Identifies amax saturation and catastrophic forgetting in HiF8 W8A8 QAT for OpenPangu-Embedded-1B and mitigates them with 64-step max-window DTS and 500-step BF16 warmup at lr=1e-5 to achieve under 0.6% benchmark drops.
HiFloat4 Format for Language Model Pre-training on Ascend NPUs cs.LG · 2026-04-09 · unverdicted · none · ref 11 · internal anchor
HiFloat4 FP4 with stabilization techniques trains dense and MoE language models on Ascend NPUs at relative error within 1% of full-precision baselines.

FP8 Formats for Deep Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer