pith. sign in

arxiv: 2510.27527 · v3 · submitted 2025-10-31 · 💻 cs.LG · cs.AI

TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control

Pith reviewed 2026-05-18 02:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords NVFP4fully quantized trainingLLM pre-trainingweight oscillationoutlier controllow-precision training4-bit quantizationdouble-block quantization
0
0 comments X

The pith

TetraJet-v2 enables accurate NVFP4 training for LLMs by suppressing weight oscillation and controlling outliers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TetraJet-v2, a new end-to-end method for 4-bit fully-quantized training of large language models using NVFP4. It identifies weight oscillation and outliers as the two critical issues that limit accuracy at this low precision. The method applies unbiased double-block quantization to linear layers along with two new algorithms: OsciReset to suppress oscillation and OutControl to retain accuracy on outliers. This leads to training performance that closes much of the gap to higher precision BF16 while providing substantial speed improvements over FP8.

Core claim

TetraJet-v2 uses NVFP4 for activations, weights, and gradients in all linear layers with an unbiased double-block quantization method that has optimal convergence. It adds OsciReset as the first effective way to suppress the weight oscillation bottleneck in LLMs and OutControl as a mixed-precision algorithm to keep outlier accuracy. Together these changes allow FP4 pre-training that reduces the performance gap to BF16 by an average of 51.3 percent.

What carries the argument

Unbiased double-block quantization combined with OsciReset for oscillation suppression and OutControl for outlier control in NVFP4 linear layers.

If this is right

  • Outperforms prior methods in FP4 pre-training for LLMs up to 370M parameters and 212B tokens.
  • Reduces the average performance gap to BF16 by 51.3%.
  • Enables a 1.67x end-to-end speedup over FP8 training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the bottlenecks remain the same, the approach may work for models much larger than 370M parameters.
  • These fixes could make 4-bit training more practical for a wider range of LLM development tasks.
  • Outlier control and oscillation suppression might apply to other low-bit formats used in training.

Load-bearing premise

Weight oscillation and outliers are the dominant bottlenecks for accurate NVFP4 training and the proposed algorithms fix them without causing new convergence or accuracy problems.

What would settle it

Training an LLM with TetraJet-v2 and measuring if the accuracy gap to BF16 is not reduced by roughly half compared to previous methods.

read the original abstract

Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers with practically optimal convergence in LLM training, 2) OsciReset, the first effective algorithm to suppress LLMs' weight oscillation bottleneck, and 3) OutControl, a mix-precision algorithm to retain outlier accuracy. TetraJet-v2 outperforms prior methods on FP4 pre-training for LLMs across models up to 370M parameters trained up to 212B tokens, reducing the performance gap to BF16 by an average of 51.3% while enabling an 1.67x end-to-end speedup over FP8. The code is available at https://github.com/thu-ml/TetraJet-v2-NVFP4Training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces TetraJet-v2, an end-to-end 4-bit fully-quantized training (FQT) method for large language models that applies NVFP4 to activations, weights, and gradients in all linear layers. It identifies weight oscillation and outliers as the primary bottlenecks and proposes three fixes: an unbiased double-block quantization scheme, the OsciReset algorithm to suppress oscillation, and the OutControl mixed-precision method to handle outliers. The central empirical claim is that TetraJet-v2 reduces the performance gap to BF16 by an average of 51.3% while delivering a 1.67x end-to-end speedup over FP8, demonstrated on models up to 370M parameters trained on up to 212B tokens. Code is stated to be publicly available.

Significance. If the reported accuracy and speedup results hold under rigorous evaluation, the work would represent a meaningful step toward practical low-precision training of LLMs, potentially lowering the resource barrier for pre-training while preserving most of the quality of higher-precision baselines. The explicit release of code is a constructive element that could facilitate independent verification.

major comments (1)
  1. Abstract: the central performance claims (51.3% average gap reduction to BF16 and 1.67x speedup over FP8) are presented without any experimental details, error bars, baseline comparisons, ablation studies, or description of the precise training setups, which are load-bearing for substantiating the empirical superiority asserted in the abstract.
minor comments (1)
  1. Abstract: the description of the three proposed components (double-block quantization, OsciReset, OutControl) is high-level; a short sentence outlining their core mechanisms would improve immediate clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential impact of TetraJet-v2 toward practical low-precision LLM training. We address the single major comment below.

read point-by-point responses
  1. Referee: [—] Abstract: the central performance claims (51.3% average gap reduction to BF16 and 1.67x speedup over FP8) are presented without any experimental details, error bars, baseline comparisons, ablation studies, or description of the precise training setups, which are load-bearing for substantiating the empirical superiority asserted in the abstract.

    Authors: We acknowledge that the abstract is intentionally concise and therefore omits the full experimental details. These details—including error bars from multiple runs, direct comparisons to BF16 and prior FP4/FP8 baselines, ablation studies isolating the contributions of unbiased double-block quantization, OsciReset, and OutControl, as well as the precise setups (models up to 370M parameters, training on up to 212B tokens)—are provided in the Experiments section of the manuscript. To directly address the concern, we will revise the abstract to include a brief reference to the evaluation scale (e.g., “demonstrated on models up to 370M parameters trained on up to 212B tokens”) while preserving its summary character. This revision will make the central claims more self-contained without exceeding standard abstract length limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract presents new algorithmic proposals (unbiased double-block quantization, OsciReset for oscillation suppression, and OutControl for outlier handling) along with empirical performance claims on models up to 370M parameters. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the text. All load-bearing elements are forward-looking algorithmic contributions evaluated against external BF16 and FP8 baselines, rendering the reported results self-contained without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so a complete ledger cannot be constructed. The approach implicitly relies on domain assumptions about NVFP4 compatibility and the effectiveness of the new control algorithms; no explicit free parameters or invented entities are named in the provided text.

axioms (1)
  • domain assumption Weight oscillation and outliers are the primary obstacles to accurate end-to-end NVFP4 training of LLMs
    The abstract states these two issues as the critical problems that the new methods are designed to solve.

pith-pipeline@v0.9.0 · 5759 in / 1470 out tokens · 76568 ms · 2026-05-18T02:08:05.474491+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

    cs.LG 2026-04 unverdicted novelty 5.0

    OSC separates token-persistent outlier channels in activations into a compact high-precision tensor for dual-path 4-bit GEMM computation, limiting accuracy loss to roughly 1-2 points on Qwen3 models while delivering u...