Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation

Andrei Panferov; Dan Alistarh; Erik Schultheis; Soroush Tabesh

arxiv: 2601.22813 · v2 · pith:MN5WLFNPnew · submitted 2026-01-30 · 💻 cs.LG

Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation

Andrei Panferov , Erik Schultheis , Soroush Tabesh , Dan Alistarh This is my paper

classification 💻 cs.LG

keywords trainingnvfp4quartetestimationgradientquantizationquantizedunbiased

0 comments

read the original abstract

The NVFP4 lower-precision format, supported in hardware by NVIDIA Blackwell GPUs, promises to allow, for the first time, end-to-end fully-quantized pre-training of massive models such as LLMs. Yet, existing quantized training methods still sacrifice some of the representation capacity of this format in favor of more accurate unbiased quantized gradient estimation by stochastic rounding (SR), losing noticeable accuracy relative to standard FP16 and FP8 training. In this paper, improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro-scaled formats, called MS-EDEN, that has more than 2x lower quantization error than SR. We integrate it into a novel fully-NVFP4 quantization scheme for linear layers, called Quartet II. We show analytically that Quartet II achieves consistently better gradient estimation across all major matrix multiplications, both on the forward and on the backward passes. In addition, our proposal synergizes well with recent training improvements aimed specifically at NVFP4. We further validate Quartet II on end-to-end LLM training with up to 1.9B parameters on 38B tokens. We provide kernels for execution on NVIDIA Blackwell GPUs with up to 4.2x speedup over BF16. Our code is available at https://github.com/IST-DASLab/Quartet-II .

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
cs.LG 2026-05 accept novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven
cs.LG 2026-05 unverdicted novelty 7.0

Two randomized Hadamard transforms suffice to make coordinate marginals O(d^{-1/2})-close to Gaussian for most quantization methods, with three needed for vector quantization to match uniform random rotations asymptotically.
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor
cs.LG 2026-05 unverdicted novelty 6.0

MXFP4 quantization error decomposes into scale bias, deadzone truncation, and grid noise; macro-block scaling, outlier fallback, and adaptive quantization noise recover BF16 accuracy to within 0.7% and 3.0% on tested models.
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor
cs.LG 2026-05 unverdicted novelty 6.0

MXFP4 error decomposes into scale bias, deadzone truncation, and grid noise that each dominate distinct RL failure modes, with macro-block scaling, outlier fallback, and adaptive noise recovering or exceeding BF16 per...
Normalized Architectures are Natively 4-Bit
cs.LG 2026-05 conditional novelty 6.0

nGPT's hypersphere constraint makes dot-product signal accumulate constructively under 4-bit quantization while noise averages out, enabling native low-precision training.
HiFloat4 Format for Language Model Pre-training on Ascend NPUs
cs.LG 2026-04 unverdicted novelty 4.0

HiFloat4 FP4 with stabilization techniques trains dense and MoE language models on Ascend NPUs at relative error within 1% of full-precision baselines.