Pretraining large language models with MXFP4 on Native FP4 Hardware
Pith reviewed 2026-05-15 05:12 UTC · model grok-4.3
The pith
Quantizing weight gradients to MXFP4 causes most FP4 training instability in large language models, while forward passes and activation gradients tolerate it with modest extra tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In end-to-end pretraining of Llama 3.1-8B on C4, progressively enabling MXFP4 shows that Wgrad quantization drives convergence degradation, whereas Fprop and Dgrad alone require only modest additional tokens. Deterministic Hadamard rotations stabilize optimization once Wgrad is quantized, whereas stochastic rounding and randomized rotations do not. This indicates that instability arises from structured micro-scaling errors in sensitive gradient paths.
What carries the argument
Progressive stage-wise enabling of MXFP4 quantization across Fprop, Dgrad, and Wgrad, paired with deterministic versus stochastic interventions such as Hadamard rotations.
If this is right
- FP4 can be used for forward propagation and activation gradients with limited extra compute.
- Weight gradients need explicit stabilization such as deterministic rotations to avoid divergence.
- Native hardware MXFP4 support enables precise diagnosis without software-emulation noise.
- Instability is driven by structured scaling errors, not by lack of stochasticity.
Where Pith is reading between the lines
- Gradient paths appear more sensitive to micro-scaling errors than forward activations, suggesting selective higher-precision treatment for Wgrad may be sufficient.
- The same deterministic rotation technique could be tested on other model scales or datasets to check whether the stabilization generalizes.
- Hardware designers could prioritize fast deterministic rotation support in future FP4 accelerators if the pattern holds.
Load-bearing premise
Progressively turning on FP4 in each training stage cleanly separates the contribution of each stage without hidden interactions from the joint optimization.
What would settle it
Running the same Llama 3.1-8B pretraining with deterministic Hadamard rotations applied to all quantized stages and checking whether final loss matches the BF16 baseline within the same token budget.
Figures
read the original abstract
Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In full pretraining of Llama 3.1-8B on the C4 dataset, we observe that quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. To interpret this behavior, we evaluate both structured and stochastic interventions under a controlled experimental setting. We find that stochastic rounding and randomized Hadamard rotations fail to stabilize training once Wgrad is quantized, whereas deterministic Hadamard rotations consistently restore stable optimization. These results suggest that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths, rather than by insufficient stochasticity. We run experiments with native MXFP4 support on AMD Instinct MI355X GPUs, enabling controlled investigation of these effects without reliance on software emulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a controlled experimental study on the effects of MXFP4 quantization during pretraining of large language models on native FP4 hardware. By progressively enabling FP4 quantization in forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) for Llama 3.1-8B on the C4 dataset, the authors conclude that Wgrad quantization is the main driver of training instability and increased token requirements, while Fprop and Dgrad FP4 have modest effects. They further demonstrate that deterministic Hadamard rotations can restore stability, attributing the issues to structured micro-scaling errors rather than insufficient randomness. Experiments leverage native MXFP4 support on AMD Instinct MI355X GPUs.
Significance. If the findings hold, this work provides valuable insights into the specific sources of instability in low-precision training pipelines, which could guide the development of more robust quantization strategies for efficient LLM pretraining. The use of native hardware support and controlled ablations strengthens the practical relevance. The identification of deterministic rotations as a stabilizing technique is a notable contribution that could be broadly applicable.
major comments (1)
- [Progressive FP4 enabling experiments (as described in abstract and methods)] The central attribution that quantizing Wgrad is the primary driver of convergence degradation is based on cumulative progressive enabling (Fprop FP4, then +Dgrad FP4, then +Wgrad FP4). This design lacks an isolated ablation applying MXFP4 only to Wgrad while keeping Fprop and Dgrad in full precision. Without it, synergistic interactions between quantized activations from earlier stages and the Wgrad computation cannot be ruled out, so the claim that instability is driven specifically by 'structured micro-scaling errors along sensitive gradient paths' in Wgrad rests on an unverified isolation assumption.
minor comments (2)
- [Abstract] The abstract provides no quantitative metrics, error bars, exact token counts, or convergence curve details, which limits immediate evaluation of effect sizes even though the full manuscript presumably contains them.
- [Intervention experiments] Clarify whether the deterministic Hadamard rotations are applied only during Wgrad computation or throughout the pipeline, and report the exact overhead in terms of additional compute or memory.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for identifying a potential gap in our experimental isolation. We address the major comment below and have revised the manuscript to include the requested ablation.
read point-by-point responses
-
Referee: [Progressive FP4 enabling experiments (as described in abstract and methods)] The central attribution that quantizing Wgrad is the primary driver of convergence degradation is based on cumulative progressive enabling (Fprop FP4, then +Dgrad FP4, then +Wgrad FP4). This design lacks an isolated ablation applying MXFP4 only to Wgrad while keeping Fprop and Dgrad in full precision. Without it, synergistic interactions between quantized activations from earlier stages and the Wgrad computation cannot be ruled out, so the claim that instability is driven specifically by 'structured micro-scaling errors along sensitive gradient paths' in Wgrad rests on an unverified isolation assumption.
Authors: We thank the referee for this observation. Our progressive enabling design was intended to isolate the incremental impact of each quantization stage under otherwise fixed conditions, and the data show that Fprop and Dgrad FP4 produce only modest token increases while the addition of Wgrad FP4 triggers the primary degradation. This incremental pattern supports our attribution to Wgrad. Nevertheless, we agree that an isolated Wgrad-only ablation is required to fully exclude synergistic interactions with prior-stage quantization. In the revised manuscript we have added this experiment (MXFP4 applied exclusively to Wgrad with Fprop and Dgrad in full precision) on the same Llama 3.1-8B / C4 setup; the new results confirm that Wgrad quantization alone reproduces the observed instability, reinforcing the interpretation of structured micro-scaling errors along gradient paths. revision: yes
Circularity Check
No circularity in experimental ablation results
full rationale
The paper's findings derive from direct empirical ablations on native MXFP4 hardware, progressively enabling FP4 in Fprop, Dgrad, and Wgrad while measuring convergence on Llama 3.1-8B pretraining. No mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the reported results. The attribution of degradation primarily to Wgrad follows from the controlled comparisons without reducing to the inputs by construction. No self-citation load-bearing steps or ansatz smuggling are present in the provided text.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Optimizing large language model training using fp4 quantization.arXiv preprint arXiv:2501.17116,
Optimizing large language model training using fp4 quantization , author=. arXiv preprint arXiv:2501.17116 , year=
work page internal anchor Pith review arXiv
-
[3]
Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023
Microscaling data formats for deep learning , author=. arXiv preprint arXiv:2310.10537 , year=
-
[4]
Advances in Neural Information Processing Systems , volume=
Outlier suppression: Pushing the limit of low-bit transformer language models , author=. Advances in Neural Information Processing Systems , volume=
-
[5]
AMD Instinct™ MI355X GPUs , author =
-
[6]
Fp4 all the way: Fully quantized training of llms.arXiv preprint arXiv:2505.19115, 2025
Fp4 all the way: Fully quantized training of llms , author=. arXiv preprint arXiv:2505.19115 , year=
-
[7]
Proceedings of the 2023 conference on empirical methods in natural language processing , pages=
Llm-fp4: 4-bit floating-point quantized transformers , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=
work page 2023
-
[8]
Towards efficient pre-training: Exploring fp4 precision in large language models , author=. arXiv preprint arXiv:2502.11458 , year=
-
[9]
Bridging the gap between promise and performance for microscaling FP4 quantization , author=. arXiv preprint arXiv:2509.23202 , year=
-
[10]
Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025
Pretraining large language models with nvfp4 , author=. arXiv preprint arXiv:2509.25149 , year=
-
[11]
Advances in Neural Information Processing Systems , volume=
Quartet: Native fp4 training can be optimal for large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
arXiv preprint arXiv:2603.08747 , year=
Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4 , author=. arXiv preprint arXiv:2603.08747 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.