TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control

Jianfei Chen; Jun Zhu; Martin Rapp; Michael Beyer; Pengle Zhang; Xiaoming Xu; Yifan Liu; Yuxiang Chen

arxiv: 2510.27527 · v3 · submitted 2025-10-31 · 💻 cs.LG · cs.AI

TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control

Yuxiang Chen , Yifan Liu , Xiaoming Xu , Pengle Zhang , Michael Beyer , Martin Rapp , Jun Zhu , Jianfei Chen This is my paper

Pith reviewed 2026-05-18 02:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords NVFP4fully quantized trainingLLM pre-trainingweight oscillationoutlier controllow-precision training4-bit quantizationdouble-block quantization

0 comments

The pith

TetraJet-v2 enables accurate NVFP4 training for LLMs by suppressing weight oscillation and controlling outliers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TetraJet-v2, a new end-to-end method for 4-bit fully-quantized training of large language models using NVFP4. It identifies weight oscillation and outliers as the two critical issues that limit accuracy at this low precision. The method applies unbiased double-block quantization to linear layers along with two new algorithms: OsciReset to suppress oscillation and OutControl to retain accuracy on outliers. This leads to training performance that closes much of the gap to higher precision BF16 while providing substantial speed improvements over FP8.

Core claim

TetraJet-v2 uses NVFP4 for activations, weights, and gradients in all linear layers with an unbiased double-block quantization method that has optimal convergence. It adds OsciReset as the first effective way to suppress the weight oscillation bottleneck in LLMs and OutControl as a mixed-precision algorithm to keep outlier accuracy. Together these changes allow FP4 pre-training that reduces the performance gap to BF16 by an average of 51.3 percent.

What carries the argument

Unbiased double-block quantization combined with OsciReset for oscillation suppression and OutControl for outlier control in NVFP4 linear layers.

If this is right

Outperforms prior methods in FP4 pre-training for LLMs up to 370M parameters and 212B tokens.
Reduces the average performance gap to BF16 by 51.3%.
Enables a 1.67x end-to-end speedup over FP8 training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the bottlenecks remain the same, the approach may work for models much larger than 370M parameters.
These fixes could make 4-bit training more practical for a wider range of LLM development tasks.
Outlier control and oscillation suppression might apply to other low-bit formats used in training.

Load-bearing premise

Weight oscillation and outliers are the dominant bottlenecks for accurate NVFP4 training and the proposed algorithms fix them without causing new convergence or accuracy problems.

What would settle it

Training an LLM with TetraJet-v2 and measuring if the accuracy gap to BF16 is not reduced by roughly half compared to previous methods.

read the original abstract

Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers with practically optimal convergence in LLM training, 2) OsciReset, the first effective algorithm to suppress LLMs' weight oscillation bottleneck, and 3) OutControl, a mix-precision algorithm to retain outlier accuracy. TetraJet-v2 outperforms prior methods on FP4 pre-training for LLMs across models up to 370M parameters trained up to 212B tokens, reducing the performance gap to BF16 by an average of 51.3% while enabling an 1.67x end-to-end speedup over FP8. The code is available at https://github.com/thu-ml/TetraJet-v2-NVFP4Training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract describes three new pieces for end-to-end NVFP4 training that reportedly close half the gap to BF16, but without methods or results details the evidence is still thin.

read the letter

Hi, the main thing to know is that TetraJet-v2 claims to fix two bottlenecks in 4-bit LLM pretraining—weight oscillation and outliers—through an unbiased double-block quantization scheme, a new OsciReset procedure, and OutControl for mixed-precision outlier handling. They report this combination reduces the performance gap to BF16 by 51.3 percent on average and delivers a 1.67 times end-to-end speedup over FP8, tested on models up to 370M parameters trained on as many as 212B tokens. They also release the code, which is the most concrete thing we have so far. That release and the focus on full 4-bit for activations, weights, and gradients in every linear layer are the parts that actually move the needle beyond earlier partial-quantization papers. The soft spots are straightforward from the abstract alone. There are no ablations, no error bars, no convergence plots, and no description of how OsciReset actually works or whether it remains stable at larger scales. The claim that these two issues are the dominant blockers is plausible, but we cannot yet tell if the fixes are robust or if they were tuned to the specific runs shown. This paper is mainly for people already building or extending low-precision training stacks. Someone who needs practical 4-bit methods could pull the repo and test the numbers on their own hardware or models. I would send it to peer review. The claims are specific enough that referees can check the experiments directly and ask for the missing controls.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces TetraJet-v2, an end-to-end 4-bit fully-quantized training (FQT) method for large language models that applies NVFP4 to activations, weights, and gradients in all linear layers. It identifies weight oscillation and outliers as the primary bottlenecks and proposes three fixes: an unbiased double-block quantization scheme, the OsciReset algorithm to suppress oscillation, and the OutControl mixed-precision method to handle outliers. The central empirical claim is that TetraJet-v2 reduces the performance gap to BF16 by an average of 51.3% while delivering a 1.67x end-to-end speedup over FP8, demonstrated on models up to 370M parameters trained on up to 212B tokens. Code is stated to be publicly available.

Significance. If the reported accuracy and speedup results hold under rigorous evaluation, the work would represent a meaningful step toward practical low-precision training of LLMs, potentially lowering the resource barrier for pre-training while preserving most of the quality of higher-precision baselines. The explicit release of code is a constructive element that could facilitate independent verification.

major comments (1)

Abstract: the central performance claims (51.3% average gap reduction to BF16 and 1.67x speedup over FP8) are presented without any experimental details, error bars, baseline comparisons, ablation studies, or description of the precise training setups, which are load-bearing for substantiating the empirical superiority asserted in the abstract.

minor comments (1)

Abstract: the description of the three proposed components (double-block quantization, OsciReset, OutControl) is high-level; a short sentence outlining their core mechanisms would improve immediate clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential impact of TetraJet-v2 toward practical low-precision LLM training. We address the single major comment below.

read point-by-point responses

Referee: [—] Abstract: the central performance claims (51.3% average gap reduction to BF16 and 1.67x speedup over FP8) are presented without any experimental details, error bars, baseline comparisons, ablation studies, or description of the precise training setups, which are load-bearing for substantiating the empirical superiority asserted in the abstract.

Authors: We acknowledge that the abstract is intentionally concise and therefore omits the full experimental details. These details—including error bars from multiple runs, direct comparisons to BF16 and prior FP4/FP8 baselines, ablation studies isolating the contributions of unbiased double-block quantization, OsciReset, and OutControl, as well as the precise setups (models up to 370M parameters, training on up to 212B tokens)—are provided in the Experiments section of the manuscript. To directly address the concern, we will revise the abstract to include a brief reference to the evaluation scale (e.g., “demonstrated on models up to 370M parameters trained on up to 212B tokens”) while preserving its summary character. This revision will make the central claims more self-contained without exceeding standard abstract length limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract presents new algorithmic proposals (unbiased double-block quantization, OsciReset for oscillation suppression, and OutControl for outlier handling) along with empirical performance claims on models up to 370M parameters. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the text. All load-bearing elements are forward-looking algorithmic contributions evaluated against external BF16 and FP8 baselines, rendering the reported results self-contained without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so a complete ledger cannot be constructed. The approach implicitly relies on domain assumptions about NVFP4 compatibility and the effectiveness of the new control algorithms; no explicit free parameters or invented entities are named in the provided text.

axioms (1)

domain assumption Weight oscillation and outliers are the primary obstacles to accurate end-to-end NVFP4 training of LLMs
The abstract states these two issues as the critical problems that the new methods are designed to solve.

pith-pipeline@v0.9.0 · 5759 in / 1470 out tokens · 76568 ms · 2026-05-18T02:08:05.474491+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension
cs.LG 2026-04 unverdicted novelty 5.0

OSC separates token-persistent outlier channels in activations into a compact high-precision tensor for dual-path 4-bit GEMM computation, limiting accuracy loss to roughly 1-2 points on Qwen3 models while delivering u...