TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
Pith reviewed 2026-05-18 02:08 UTC · model grok-4.3
The pith
TetraJet-v2 enables accurate NVFP4 training for LLMs by suppressing weight oscillation and controlling outliers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TetraJet-v2 uses NVFP4 for activations, weights, and gradients in all linear layers with an unbiased double-block quantization method that has optimal convergence. It adds OsciReset as the first effective way to suppress the weight oscillation bottleneck in LLMs and OutControl as a mixed-precision algorithm to keep outlier accuracy. Together these changes allow FP4 pre-training that reduces the performance gap to BF16 by an average of 51.3 percent.
What carries the argument
Unbiased double-block quantization combined with OsciReset for oscillation suppression and OutControl for outlier control in NVFP4 linear layers.
If this is right
- Outperforms prior methods in FP4 pre-training for LLMs up to 370M parameters and 212B tokens.
- Reduces the average performance gap to BF16 by 51.3%.
- Enables a 1.67x end-to-end speedup over FP8 training.
Where Pith is reading between the lines
- If the bottlenecks remain the same, the approach may work for models much larger than 370M parameters.
- These fixes could make 4-bit training more practical for a wider range of LLM development tasks.
- Outlier control and oscillation suppression might apply to other low-bit formats used in training.
Load-bearing premise
Weight oscillation and outliers are the dominant bottlenecks for accurate NVFP4 training and the proposed algorithms fix them without causing new convergence or accuracy problems.
What would settle it
Training an LLM with TetraJet-v2 and measuring if the accuracy gap to BF16 is not reduced by roughly half compared to previous methods.
read the original abstract
Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers with practically optimal convergence in LLM training, 2) OsciReset, the first effective algorithm to suppress LLMs' weight oscillation bottleneck, and 3) OutControl, a mix-precision algorithm to retain outlier accuracy. TetraJet-v2 outperforms prior methods on FP4 pre-training for LLMs across models up to 370M parameters trained up to 212B tokens, reducing the performance gap to BF16 by an average of 51.3% while enabling an 1.67x end-to-end speedup over FP8. The code is available at https://github.com/thu-ml/TetraJet-v2-NVFP4Training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TetraJet-v2, an end-to-end 4-bit fully-quantized training (FQT) method for large language models that applies NVFP4 to activations, weights, and gradients in all linear layers. It identifies weight oscillation and outliers as the primary bottlenecks and proposes three fixes: an unbiased double-block quantization scheme, the OsciReset algorithm to suppress oscillation, and the OutControl mixed-precision method to handle outliers. The central empirical claim is that TetraJet-v2 reduces the performance gap to BF16 by an average of 51.3% while delivering a 1.67x end-to-end speedup over FP8, demonstrated on models up to 370M parameters trained on up to 212B tokens. Code is stated to be publicly available.
Significance. If the reported accuracy and speedup results hold under rigorous evaluation, the work would represent a meaningful step toward practical low-precision training of LLMs, potentially lowering the resource barrier for pre-training while preserving most of the quality of higher-precision baselines. The explicit release of code is a constructive element that could facilitate independent verification.
major comments (1)
- Abstract: the central performance claims (51.3% average gap reduction to BF16 and 1.67x speedup over FP8) are presented without any experimental details, error bars, baseline comparisons, ablation studies, or description of the precise training setups, which are load-bearing for substantiating the empirical superiority asserted in the abstract.
minor comments (1)
- Abstract: the description of the three proposed components (double-block quantization, OsciReset, OutControl) is high-level; a short sentence outlining their core mechanisms would improve immediate clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential impact of TetraJet-v2 toward practical low-precision LLM training. We address the single major comment below.
read point-by-point responses
-
Referee: [—] Abstract: the central performance claims (51.3% average gap reduction to BF16 and 1.67x speedup over FP8) are presented without any experimental details, error bars, baseline comparisons, ablation studies, or description of the precise training setups, which are load-bearing for substantiating the empirical superiority asserted in the abstract.
Authors: We acknowledge that the abstract is intentionally concise and therefore omits the full experimental details. These details—including error bars from multiple runs, direct comparisons to BF16 and prior FP4/FP8 baselines, ablation studies isolating the contributions of unbiased double-block quantization, OsciReset, and OutControl, as well as the precise setups (models up to 370M parameters, training on up to 212B tokens)—are provided in the Experiments section of the manuscript. To directly address the concern, we will revise the abstract to include a brief reference to the evaluation scale (e.g., “demonstrated on models up to 370M parameters trained on up to 212B tokens”) while preserving its summary character. This revision will make the central claims more self-contained without exceeding standard abstract length limits. revision: yes
Circularity Check
No significant circularity detected
full rationale
The abstract presents new algorithmic proposals (unbiased double-block quantization, OsciReset for oscillation suppression, and OutControl for outlier handling) along with empirical performance claims on models up to 370M parameters. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the text. All load-bearing elements are forward-looking algorithmic contributions evaluated against external BF16 and FP8 baselines, rendering the reported results self-contained without reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Weight oscillation and outliers are the primary obstacles to accurate end-to-end NVFP4 training of LLMs
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension
OSC separates token-persistent outlier channels in activations into a compact high-precision tensor for dual-path 4-bit GEMM computation, limiting accuracy loss to roughly 1-2 points on Qwen3 models while delivering u...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.