Pretraining large language models with MXFP4 on Native FP4 Hardware

Mahmut Taylan Kandemir; Meena Arunachalam; Miro Hodak; Musa Cim; Poovaiah Palangappa; Ravi Dwivedula

arxiv: 2605.09825 · v3 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Pretraining large language models with MXFP4 on Native FP4 Hardware

Musa Cim , Poovaiah Palangappa , Miro Hodak , Ravi Dwivedula , Meena Arunachalam , Mahmut Taylan Kandemir This is my paper

Pith reviewed 2026-05-15 05:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords MXFP4 quantizationFP4 trainingweight gradientLLM pretrainingHadamard rotationtransformer optimizationnative hardware

0 comments

The pith

Quantizing weight gradients to MXFP4 causes most FP4 training instability in large language models, while forward passes and activation gradients tolerate it with modest extra tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates the sources of divergence in full-pipeline FP4 training by turning on MXFP4 quantization one stage at a time in transformer pretraining. It finds that weight-gradient quantization is the dominant source of instability, while FP4 in forward propagation and activation gradients adds only small token overhead. Deterministic Hadamard rotations restore stable convergence when weight gradients are quantized, but stochastic rounding and randomized rotations do not. These controlled results point to structured scaling errors along gradient paths rather than insufficient randomness as the root cause. The experiments run directly on native MXFP4 hardware, removing emulation artifacts.

Core claim

In end-to-end pretraining of Llama 3.1-8B on C4, progressively enabling MXFP4 shows that Wgrad quantization drives convergence degradation, whereas Fprop and Dgrad alone require only modest additional tokens. Deterministic Hadamard rotations stabilize optimization once Wgrad is quantized, whereas stochastic rounding and randomized rotations do not. This indicates that instability arises from structured micro-scaling errors in sensitive gradient paths.

What carries the argument

Progressive stage-wise enabling of MXFP4 quantization across Fprop, Dgrad, and Wgrad, paired with deterministic versus stochastic interventions such as Hadamard rotations.

If this is right

FP4 can be used for forward propagation and activation gradients with limited extra compute.
Weight gradients need explicit stabilization such as deterministic rotations to avoid divergence.
Native hardware MXFP4 support enables precise diagnosis without software-emulation noise.
Instability is driven by structured scaling errors, not by lack of stochasticity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Gradient paths appear more sensitive to micro-scaling errors than forward activations, suggesting selective higher-precision treatment for Wgrad may be sufficient.
The same deterministic rotation technique could be tested on other model scales or datasets to check whether the stabilization generalizes.
Hardware designers could prioritize fast deterministic rotation support in future FP4 accelerators if the pattern holds.

Load-bearing premise

Progressively turning on FP4 in each training stage cleanly separates the contribution of each stage without hidden interactions from the joint optimization.

What would settle it

Running the same Llama 3.1-8B pretraining with deterministic Hadamard rotations applied to all quantized stages and checking whether final loss matches the BF16 baseline within the same token budget.

Figures

Figures reproduced from arXiv: 2605.09825 by Mahmut Taylan Kandemir, Meena Arunachalam, Miro Hodak, Musa Cim, Poovaiah Palangappa, Ravi Dwivedula.

**Figure 1.** Figure 1: Validation perplexity vs. training tokens for Llama 3.1–8B under MLPerf pretraining on C4 dataset. We compare FP8, fullpipeline MXFP4 (Fprop + Dgrad + Wgrad; no stabilizer), and full-pipeline MXFP4 + deterministic Hadamard (H16). MXFP4 + deterministic Hadamard closely tracks FP8, while full-pipeline MXFP4 without stabilization converges more slowly and is less stable [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 3.** Figure 3: Hadamard-transformed MXFP4 architecture for forward and backward passes. Inputs [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of block quantization strategies: 2D (32 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In full pretraining of Llama 3.1-8B on the C4 dataset, we observe that quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. To interpret this behavior, we evaluate both structured and stochastic interventions under a controlled experimental setting. We find that stochastic rounding and randomized Hadamard rotations fail to stabilize training once Wgrad is quantized, whereas deterministic Hadamard rotations consistently restore stable optimization. These results suggest that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths, rather than by insufficient stochasticity. We run experiments with native MXFP4 support on AMD Instinct MI355X GPUs, enabling controlled investigation of these effects without reliance on software emulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Wgrad quantization drives the instability in this MXFP4 setup, and deterministic Hadamard rotations fix it where stochastic tricks do not.

read the letter

The paper's central observation is that in full pretraining of Llama 3.1-8B on C4 with MXFP4, enabling FP4 only on forward propagation and activation gradients adds only modest extra tokens to convergence, but adding weight-gradient quantization is what breaks things. They reach this by progressively turning on the three stages while keeping everything else fixed, then testing interventions. Stochastic rounding and randomized Hadamard rotations do not restore stability once Wgrad is quantized, but deterministic Hadamard rotations do. That differential result is the clearest new empirical point. Running the whole thing on native MXFP4 hardware on AMD MI355X GPUs is also useful; it removes the usual emulation caveats that plague most low-precision papers. The controlled progressive design is a reasonable way to surface which stage matters most. The main soft spot is the isolation claim. Because Fprop and Dgrad are already quantized when Wgrad quantization is added, the inputs to the weight-gradient computation are already noisy, so any interaction between those earlier errors and the Wgrad stage is baked in. An ablation that applies MXFP4 only to Wgrad while leaving the forward and activation-gradient paths in higher precision would have made the attribution tighter. The abstract also gives no token counts, variance numbers, or convergence curves, so the size of the effects is hard to judge from the summary alone. This is the kind of work that matters for groups trying to push FP4 training onto real silicon. It is worth sending to peer review because the hardware-backed experiment and the deterministic-versus-stochastic contrast are concrete enough to be checked and extended.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a controlled experimental study on the effects of MXFP4 quantization during pretraining of large language models on native FP4 hardware. By progressively enabling FP4 quantization in forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) for Llama 3.1-8B on the C4 dataset, the authors conclude that Wgrad quantization is the main driver of training instability and increased token requirements, while Fprop and Dgrad FP4 have modest effects. They further demonstrate that deterministic Hadamard rotations can restore stability, attributing the issues to structured micro-scaling errors rather than insufficient randomness. Experiments leverage native MXFP4 support on AMD Instinct MI355X GPUs.

Significance. If the findings hold, this work provides valuable insights into the specific sources of instability in low-precision training pipelines, which could guide the development of more robust quantization strategies for efficient LLM pretraining. The use of native hardware support and controlled ablations strengthens the practical relevance. The identification of deterministic rotations as a stabilizing technique is a notable contribution that could be broadly applicable.

major comments (1)

[Progressive FP4 enabling experiments (as described in abstract and methods)] The central attribution that quantizing Wgrad is the primary driver of convergence degradation is based on cumulative progressive enabling (Fprop FP4, then +Dgrad FP4, then +Wgrad FP4). This design lacks an isolated ablation applying MXFP4 only to Wgrad while keeping Fprop and Dgrad in full precision. Without it, synergistic interactions between quantized activations from earlier stages and the Wgrad computation cannot be ruled out, so the claim that instability is driven specifically by 'structured micro-scaling errors along sensitive gradient paths' in Wgrad rests on an unverified isolation assumption.

minor comments (2)

[Abstract] The abstract provides no quantitative metrics, error bars, exact token counts, or convergence curve details, which limits immediate evaluation of effect sizes even though the full manuscript presumably contains them.
[Intervention experiments] Clarify whether the deterministic Hadamard rotations are applied only during Wgrad computation or throughout the pipeline, and report the exact overhead in terms of additional compute or memory.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying a potential gap in our experimental isolation. We address the major comment below and have revised the manuscript to include the requested ablation.

read point-by-point responses

Referee: [Progressive FP4 enabling experiments (as described in abstract and methods)] The central attribution that quantizing Wgrad is the primary driver of convergence degradation is based on cumulative progressive enabling (Fprop FP4, then +Dgrad FP4, then +Wgrad FP4). This design lacks an isolated ablation applying MXFP4 only to Wgrad while keeping Fprop and Dgrad in full precision. Without it, synergistic interactions between quantized activations from earlier stages and the Wgrad computation cannot be ruled out, so the claim that instability is driven specifically by 'structured micro-scaling errors along sensitive gradient paths' in Wgrad rests on an unverified isolation assumption.

Authors: We thank the referee for this observation. Our progressive enabling design was intended to isolate the incremental impact of each quantization stage under otherwise fixed conditions, and the data show that Fprop and Dgrad FP4 produce only modest token increases while the addition of Wgrad FP4 triggers the primary degradation. This incremental pattern supports our attribution to Wgrad. Nevertheless, we agree that an isolated Wgrad-only ablation is required to fully exclude synergistic interactions with prior-stage quantization. In the revised manuscript we have added this experiment (MXFP4 applied exclusively to Wgrad with Fprop and Dgrad in full precision) on the same Llama 3.1-8B / C4 setup; the new results confirm that Wgrad quantization alone reproduces the observed instability, reinforcing the interpretation of structured micro-scaling errors along gradient paths. revision: yes

Circularity Check

0 steps flagged

No circularity in experimental ablation results

full rationale

The paper's findings derive from direct empirical ablations on native MXFP4 hardware, progressively enabling FP4 in Fprop, Dgrad, and Wgrad while measuring convergence on Llama 3.1-8B pretraining. No mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the reported results. The attribution of degradation primarily to Wgrad follows from the controlled comparisons without reducing to the inputs by construction. No self-citation load-bearing steps or ansatz smuggling are present in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work is presented as an empirical controlled study.

pith-pipeline@v0.9.0 · 5526 in / 1061 out tokens · 57296 ms · 2026-05-15T05:12:36.108870+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Optimizing large language model training using fp4 quantization.arXiv preprint arXiv:2501.17116,

Optimizing large language model training using fp4 quantization , author=. arXiv preprint arXiv:2501.17116 , year=

work page internal anchor Pith review arXiv
[3]

Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023

Microscaling data formats for deep learning , author=. arXiv preprint arXiv:2310.10537 , year=

work page arXiv
[4]

Advances in Neural Information Processing Systems , volume=

Outlier suppression: Pushing the limit of low-bit transformer language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[5]

AMD Instinct™ MI355X GPUs , author =

work page
[6]

Fp4 all the way: Fully quantized training of llms.arXiv preprint arXiv:2505.19115, 2025

Fp4 all the way: Fully quantized training of llms , author=. arXiv preprint arXiv:2505.19115 , year=

work page arXiv
[7]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Llm-fp4: 4-bit floating-point quantized transformers , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

work page 2023
[8]

Towards efficient pre-training: Exploring fp4 precision in large language models.arXiv preprint arXiv:2502.11458,

Towards efficient pre-training: Exploring fp4 precision in large language models , author=. arXiv preprint arXiv:2502.11458 , year=

work page arXiv
[9]

Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

Bridging the gap between promise and performance for microscaling FP4 quantization , author=. arXiv preprint arXiv:2509.23202 , year=

work page arXiv
[10]

Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

Pretraining large language models with nvfp4 , author=. arXiv preprint arXiv:2509.25149 , year=

work page arXiv
[11]

Advances in Neural Information Processing Systems , volume=

Quartet: Native fp4 training can be optimal for large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

arXiv preprint arXiv:2603.08747 , year=

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4 , author=. arXiv preprint arXiv:2603.08747 , year=

work page arXiv

[1] [1]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Optimizing large language model training using fp4 quantization.arXiv preprint arXiv:2501.17116,

Optimizing large language model training using fp4 quantization , author=. arXiv preprint arXiv:2501.17116 , year=

work page internal anchor Pith review arXiv

[3] [3]

Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023

Microscaling data formats for deep learning , author=. arXiv preprint arXiv:2310.10537 , year=

work page arXiv

[4] [4]

Advances in Neural Information Processing Systems , volume=

Outlier suppression: Pushing the limit of low-bit transformer language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[5] [5]

AMD Instinct™ MI355X GPUs , author =

work page

[6] [6]

Fp4 all the way: Fully quantized training of llms.arXiv preprint arXiv:2505.19115, 2025

Fp4 all the way: Fully quantized training of llms , author=. arXiv preprint arXiv:2505.19115 , year=

work page arXiv

[7] [7]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Llm-fp4: 4-bit floating-point quantized transformers , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

work page 2023

[8] [8]

Towards efficient pre-training: Exploring fp4 precision in large language models.arXiv preprint arXiv:2502.11458,

Towards efficient pre-training: Exploring fp4 precision in large language models , author=. arXiv preprint arXiv:2502.11458 , year=

work page arXiv

[9] [9]

Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

Bridging the gap between promise and performance for microscaling FP4 quantization , author=. arXiv preprint arXiv:2509.23202 , year=

work page arXiv

[10] [10]

Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

Pretraining large language models with nvfp4 , author=. arXiv preprint arXiv:2509.25149 , year=

work page arXiv

[11] [11]

Advances in Neural Information Processing Systems , volume=

Quartet: Native fp4 training can be optimal for large language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[12] [12]

arXiv preprint arXiv:2603.08747 , year=

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4 , author=. arXiv preprint arXiv:2603.08747 , year=

work page arXiv