Optimizing Large Language Model Training Using FP4 Quantization

Baining Guo; Guoshuai Zhao; Peng Cheng; Ruizhe Wang; Xiao Liu; Yeyun Gong; Zhengjun Zha; Ziyue Yang

arxiv: 2501.17116 · v2 · pith:6BBITQMBnew · submitted 2025-01-28 · 💻 cs.LG · cs.CL

Optimizing Large Language Model Training Using FP4 Quantization

Ruizhe Wang , Yeyun Gong , Xiao Liu , Guoshuai Zhao , Ziyue Yang , Baining Guo , Zhengjun Zha , Peng Cheng This is my paper

Pith reviewed 2026-05-23 04:29 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords FP4 quantizationlow-precision traininglarge language modelsdifferentiable quantizationoutlier clampingmixed-precision trainingvector-wise quantizationtraining stability

0 comments

The pith

FP4 quantization trains large language models to accuracy levels comparable to BF16 and FP8.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the first complete framework for training LLMs entirely in FP4 precision. It tackles high quantization error through a differentiable estimator that allows accurate weight updates and an outlier clamping method that stops activations from collapsing. Experiments show the approach works on models up to 13 billion parameters trained on 100 billion tokens while keeping accuracy close to higher-precision baselines. If the method holds, it would cut the memory and compute cost of training without requiring new hardware beyond what is already emerging.

Core claim

The central claim is that a combination of a differentiable quantization estimator, outlier clamping with compensation, mixed-precision training, and vector-wise quantization makes FP4 stable enough for LLM training, delivering accuracy comparable to BF16 and FP8 with only minimal degradation even at 13B scale and 100B tokens.

What carries the argument

The differentiable quantization estimator, which replaces non-differentiable rounding with a smooth approximation so that gradients can flow back to the original weights during training.

If this is right

FP4 training becomes practical for models at least as large as 13B parameters without accuracy collapse.
Memory footprint and arithmetic cost during training drop to roughly one-quarter of BF16 while retaining usable model quality.
The same mixed-precision and compensation techniques can be reused as next-generation hardware adds native FP4 support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the estimator generalizes, the same differentiable approach might extend FP4 training to architectures beyond the transformers tested here.
Successful FP4 training would make it feasible to run full pre-training on hardware clusters whose accelerators are optimized only for four-bit formats.
The outlier compensation step could be tested independently on activation statistics from models larger than 13B to check whether the same clamping thresholds remain sufficient.

Load-bearing premise

The estimator plus clamping will keep activations from collapsing and preserve training stability at every model size and data distribution without hidden failure modes.

What would settle it

Train a 70B-parameter model on at least 100B tokens using the FP4 framework and compare final validation loss or downstream task scores against an identical BF16 run; a gap larger than a few percent would falsify the claim of comparable accuracy.

read the original abstract

The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first working FP4 training pipeline for LLMs up to 13B on 100B tokens that stays close to BF16/FP8 accuracy.

read the letter

The main takeaway is that they built and tested an FP4 training framework that reaches accuracy comparable to BF16 and FP8 at 13B scale after 100B tokens, with only minimal degradation. This is the first such framework reported for LLM training. The two concrete additions are a differentiable quantization estimator that supports proper weight updates and an outlier clamping plus compensation method to avoid activation collapse. They layer on mixed-precision training and vector-wise quantization to maintain stability through the run. The full manuscript supplies the experimental validation for exactly this regime, and the stress-test note finds no internal inconsistency or untested assumption required for the stated claim. The techniques directly target the range and error problems that blocked earlier FP4 attempts. The results are presented as direct empirical comparisons rather than fitted identities. Soft spots are minor and proportionate. The experiments cover the claimed model sizes and token counts but leave open how the same fixes behave at substantially larger scales or on different architectures and data mixes. The forward-looking hardware comment is reasonable but not tested in the work itself. This paper is for researchers working on low-precision training systems who need concrete methods and scaling numbers. A reader in that group gets usable techniques and evidence they can examine or extend. It deserves a serious referee because the central result is concrete, the methods are specified, and the experiments align with the headline claim without circularity or hidden fitting. I recommend sending it to peer review.

Referee Report

0 major / 2 minor

Summary. The paper introduces the first FP4 training framework for large language models. Key components include a differentiable quantization estimator for weight updates, an outlier clamping and compensation strategy to prevent activation collapse, a mixed-precision training scheme, and vector-wise quantization for stability. The central empirical claim is that this FP4 framework achieves accuracy comparable to BF16 and FP8 baselines with only minimal degradation and scales successfully to 13B-parameter models trained on up to 100B tokens.

Significance. If the experimental results hold under scrutiny, the work is significant as the first demonstration of stable FP4 training for LLMs at this scale. It directly addresses the gap between FP8 feasibility and the challenges of FP4 (quantization error and limited capacity), providing a practical foundation for ultra-low-precision training on next-generation hardware that supports FP4 arithmetic.

minor comments (2)

[Abstract] The abstract states that results demonstrate 'accuracy comparable to BF16 and FP8, with minimal degradation' but supplies no quantitative values (e.g., perplexity deltas or accuracy percentages). Adding one or two concrete numbers would strengthen the summary without altering the manuscript length.
[Section 3 (Method)] The description of the differentiable quantization estimator and the outlier compensation mechanism would benefit from an explicit statement of the gradient approximation used during the backward pass (straight-through estimator or otherwise) to allow exact reproduction.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the work's significance as the first FP4 training framework at this scale, and recommendation for minor revision. We appreciate the constructive tone and will address any minor points in the revised manuscript.

Circularity Check

0 steps flagged

Empirical result with no circular derivation

full rationale

The paper introduces an FP4 training framework consisting of a differentiable quantization estimator, outlier clamping/compensation, mixed-precision, and vector-wise quantization, then validates it empirically on LLMs up to 13B parameters trained on 100B tokens. The central claim (comparable accuracy to BF16/FP8) is presented as an experimental outcome rather than a mathematical derivation or prediction that reduces by construction to fitted inputs, self-citations, or definitional identities. No equations, uniqueness theorems, or ansatzes are invoked in a load-bearing way that collapses the result to its own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all technical details required to audit the ledger are absent from the provided text.

pith-pipeline@v0.9.0 · 5702 in / 1203 out tokens · 34452 ms · 2026-05-23T04:29:31.617506+00:00 · methodology

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pretraining large language models with MXFP4 on Native FP4 Hardware
cs.LG 2026-05 unverdicted novelty 7.0

Weight-gradient quantization drives most convergence problems in MXFP4 pretraining of Llama 3.1-8B; deterministic Hadamard rotations stabilize training by correcting structured micro-scaling errors.
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
cs.CL 2025-12 conditional novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
cs.LG 2025-10 unverdicted novelty 7.0

Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
Pretraining large language models with MXFP4 on Native FP4 Hardware
cs.LG 2026-05 unverdicted novelty 6.0

Weight gradient quantization is the main driver of instability in full-pipeline FP4 LLM training, mitigated by deterministic Hadamard rotations rather than added stochasticity.
Pretraining large language models with MXFP4 on Native FP4 Hardware
cs.LG 2026-05 unverdicted novelty 6.0

Weight gradient FP4 quantization drives LLM pretraining divergence, which deterministic Hadamard rotations can stabilize on native MXFP4 hardware.
Normalized Architectures are Natively 4-Bit
cs.LG 2026-05 conditional novelty 6.0

nGPT's hypersphere constraint makes dot-product signal accumulate constructively under 4-bit quantization while noise averages out, enabling native low-precision training.
Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts
cs.LG 2025-10 unverdicted novelty 5.0

Orthogonal growth recycles pre-trained MoE checkpoints via layer copying and noisy expert duplication, delivering 10.6% higher accuracy than training from scratch with equivalent extra compute.
HiFloat4 Format for Language Model Pre-training on Ascend NPUs
cs.LG 2026-04 unverdicted novelty 4.0

HiFloat4 FP4 with stabilization techniques trains dense and MoE language models on Ascend NPUs at relative error within 1% of full-precision baselines.