pith. machine review for the scientific record. sign in

arxiv: 2604.17789 · v2 · submitted 2026-04-20 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords MXFP4 quantizationLLM inferenceoutlier rotation4-bit quantizationLLaMA-3microscalingactivation outliersW4A4
0
0 comments X

The pith

A single outlier-aware rotation suffices for accurate MXFP4 quantization of LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes adapting the outlier-aware fine-grained rotation from earlier work to the MXFP4 microscaling format by setting the rotation block size exactly to the 32-element groups that share a scale factor. This alignment makes the dual-rotation and zigzag steps of the prior method unnecessary because each group carries its own independent scale, so cross-block variance does not distort the shared factor. The resulting single rotation targets outlier-heavy channels directly, cuts the online rotation cost in half, and produces smoother weight distributions. Experiments on LLaMA-3 models under MXFP4 weight-and-activation 4-bit quantization show consistent state-of-the-art accuracy. If the claim holds, the approach reduces both error and compute for hardware-native low-precision inference on large models.

Core claim

DuQuant++ demonstrates that aligning the block size of an outlier-aware rotation to the MXFP4 microscaling group size of 32 allows the full dual-rotation pipeline to be replaced by one rotation step. The single rotation suppresses activation outliers that would otherwise inflate a block's shared E8M0 scale factor, thereby preserving dynamic range for the remaining elements, while also smoothing the weight distribution and halving online rotation overhead during MXFP4 W4A4 quantization of LLMs such as the LLaMA-3 family.

What carries the argument

outlier-aware fine-grained rotation with rotation block size set to B=32 to match MXFP4 microscaling groups, which applies a targeted transform to outlier-concentrated channels within each independently scaled block

Load-bearing premise

Independent scaling factors per 32-element MXFP4 group remove the cross-block variance that previously required dual rotations and zigzag permutations.

What would settle it

An experiment on an LLM in which the single-rotation version produces measurably higher quantization error or lower task accuracy than the original dual-rotation pipeline under identical MXFP4 W4A4 settings.

Figures

Figures reproduced from arXiv: 2604.17789 by Bingchen Yao, Haobo Xu, Haokun Lin, Qingfu Zhang, Xianglong Guo, Xinle Jia, Yichen Wu, Ying Wei, Zhenan Sun, Zhichao Lu.

Figure 1
Figure 1. Figure 1: MXFP4 quantization error across all 32 layers of LLaMA-3-8B at three representative [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

The MXFP4 microscaling format, which partitions tensors into blocks of 32 elements sharing an E8M0 scaling factor, has emerged as a promising substrate for efficient LLM inference, backed by native hardware support on NVIDIA Blackwell Tensor Cores. However, activation outliers pose a unique challenge under this format: a single outlier inflates the shared block scale, compressing the effective dynamic range of the remaining elements and causing significant quantization error. Existing rotation-based remedies, including randomized Hadamard and learnable rotations, are data-agnostic and therefore unable to specifically target the channels where outliers concentrate. We propose DuQuant++, which adapts the outlier-aware fine-grained rotation of DuQuant to the MXFP4 format by aligning the rotation block size with the microscaling group size (B{=}32). Because each MXFP4 group possesses an independent scaling factor, the cross-block variance issue that necessitates dual rotations and a zigzag permutation in the original DuQuant becomes irrelevant, enabling DuQuant++ to replace the entire pipeline with a single outlier-aware rotation, which halves the online rotation cost while simultaneously smoothing the weight distribution. Extensive experiments on the LLaMA-3 family under MXFP4 W4A4 quantization show that DuQuant++ consistently achieves state-of-the-art performance. Our code is available at https://github.com/Hsu1023/DuQuant-v2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes DuQuant++, an adaptation of the DuQuant outlier-aware rotation method to the MXFP4 microscaling FP4 format for LLM quantization. By setting the rotation block size B=32 to match the MXFP4 group size and exploiting independent per-group E8M0 scaling factors, the authors replace the original DuQuant's dual rotations plus zigzag permutation with a single outlier-aware rotation. This is claimed to halve online rotation cost while smoothing weight distributions, with extensive experiments on LLaMA-3 models under W4A4 quantization demonstrating state-of-the-art performance.

Significance. If the central empirical claims hold, the work provides a practical, lower-overhead rotation strategy for microscaling quantization that directly targets activation outliers, which is relevant for efficient inference on hardware with native MXFP4 support such as NVIDIA Blackwell Tensor Cores.

major comments (2)
  1. [Method / justification for single rotation] The central engineering claim—that independent per-group scaling in MXFP4 fully eliminates the cross-block variance issue, rendering dual rotations and zigzag permutation unnecessary—is asserted without supporting analysis or ablation. No demonstration is given that outlier-induced variance does not propagate across groups or that single-rotation performance matches the dual-rotation baseline under MXFP4 (this directly supports the cost-halving and SOTA claims).
  2. [Experiments] The experiments section reports SOTA results on LLaMA-3 under MXFP4 W4A4 but lacks explicit ablation tables isolating the effect of replacing dual rotations with the single outlier-aware rotation; without these, it is difficult to attribute gains specifically to the proposed simplification rather than other factors.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., perplexity or accuracy delta versus the strongest baseline) to ground the SOTA claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. We address each major comment point by point below. We agree that additional analysis and ablations will strengthen the paper and will incorporate them in the revision.

read point-by-point responses
  1. Referee: [Method / justification for single rotation] The central engineering claim—that independent per-group scaling in MXFP4 fully eliminates the cross-block variance issue, rendering dual rotations and zigzag permutation unnecessary—is asserted without supporting analysis or ablation. No demonstration is given that outlier-induced variance does not propagate across groups or that single-rotation performance matches the dual-rotation baseline under MXFP4 (this directly supports the cost-halving and SOTA claims).

    Authors: We appreciate this observation. The justification is based on the MXFP4 format property that each 32-element group has an independent E8M0 scaling factor, unlike standard quantization where a shared scale allows outlier variance to propagate across blocks. This independence isolates the effect of outliers to their own group, so a single outlier-aware rotation (with B=32) suffices to smooth distributions within each group without needing dual rotations or zigzag permutation. While this is stated in the method section, we agree more explicit support is needed. In the revision we will add a short theoretical paragraph explaining the lack of cross-group propagation and an ablation comparing single-rotation DuQuant++ against an adapted dual-rotation baseline under MXFP4 to directly support the cost and performance claims. revision: yes

  2. Referee: [Experiments] The experiments section reports SOTA results on LLaMA-3 under MXFP4 W4A4 but lacks explicit ablation tables isolating the effect of replacing dual rotations with the single outlier-aware rotation; without these, it is difficult to attribute gains specifically to the proposed simplification rather than other factors.

    Authors: We agree that the current experiments emphasize end-to-end comparisons rather than isolating the single-rotation simplification. We will add dedicated ablation tables in the revised manuscript that directly compare (i) DuQuant++ (single rotation), (ii) an MXFP4-adapted dual-rotation variant, and (iii) the original DuQuant pipeline. These tables will quantify the performance difference and overhead reduction attributable to the simplification enabled by per-group scaling. revision: yes

Circularity Check

0 steps flagged

Empirical adaptation of prior method with no self-referential derivation

full rationale

The paper presents DuQuant++ as an engineering adaptation of the outlier-aware fine-grained rotation from DuQuant, aligned to MXFP4 by setting rotation block size B=32 to match the microscaling group size. The key simplification (replacing dual rotations and zigzag permutation with a single rotation) is asserted because independent per-group E8M0 scales make cross-block variance irrelevant. No mathematical equations, derivations, or predictions are provided that reduce by construction to fitted inputs, self-definitions, or prior self-citations. Validity is supported by external experiments on LLaMA-3 under W4A4, not internal consistency. The citation to DuQuant is present but not load-bearing for any uniqueness theorem or ansatz; the central claim remains an empirical observation rather than a closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that outlier channels identified by the prior DuQuant method remain effective when the rotation block is forced to exactly 32 elements; no free parameters, new entities, or additional axioms are introduced in the abstract.

axioms (1)
  • domain assumption Activation outliers concentrate in specific channels that can be targeted by a data-dependent rotation.
    Inherited from the cited DuQuant work and required for the single-rotation simplification to be beneficial.

pith-pipeline@v0.9.0 · 5578 in / 1448 out tokens · 60681 ms · 2026-05-10T05:55:39.894240+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 21 canonical work pages · 6 internal anchors

  1. [1]

    Quarot: Outlier-free 4-bit inference in rotated llms,

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.arXiv preprint arXiv:2404.00456,

  2. [2]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  3. [3]

    Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

    Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han. Four over six: More accurate nvfp4 quantization with adaptive block scaling.arXiv preprint arXiv:2512.02010,

  4. [4]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  5. [5]

    arXiv preprint arXiv:2509.23202 , year=

    Vage Egiazarian, Roberto L Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, et al. Bridging the gap between promise and performance for microscaling fp4 quantization.arXiv preprint arXiv:2509.23202,

  6. [6]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

  7. [7]

    Quaff: Quantized parameter-efficient fine-tuning under outlier spatial stability hypothesis.arXiv preprint arXiv:2505.14742,

    Hong Huang and Dapeng Wu. Quaff: Quantized parameter-efficient fine-tuning under outlier spatial stability hypothesis.arXiv preprint arXiv:2505.14742,

  8. [8]

    Tequila: Trapping-free ternary quantization for large language models

    Hong Huang, Decheng Wu, Rui Cen, Guanghua Yu, Zonghang Li, Kai Liu, Jianchen Zhu, Peng Chen, Xue Liu, and Dapeng Wu. Tequila: Trapping-free ternary quantization for large language models. arXiv preprint arXiv:2509.23809,

  9. [9]

    Sherry: Hardware-efficient 1.25-bit ternary quantization via fine-grained sparsification

    Hong Huang, Decheng Wu, Qiangqiang Hu, Guanghua Yu, Jinhai Yang, Jianchen Zhu, Xue Liu, and Dapeng Wu. Sherry: Hardware-efficient 1.25-bit ternary quantization via fine-grained sparsification. arXiv preprint arXiv:2601.07892,

  10. [10]

    Duquant: Distributing outliers via dual transformation makes stronger quantized llms.Advances in Neural Information Processing Systems, 37:87766–87800, 2024a

    Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms.Advances in Neural Information Processing Systems, 37:87766–87800, 2024a. Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu ...

  11. [11]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation- aware weight quantization for llm compression and acceleration.arXiv preprint arXiv:2306.00978,

  12. [12]

    Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

    Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.arXiv preprint arXiv:2405.04532, 2024b. Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, and Chun Yuan. Intactkv: Improving large language model quan...

  13. [13]

    Micromix: Efficient mixed-precision quantization with microscaling formats for large language models.arXiv preprint arXiv:2508.02343,

    Wenyuan Liu, Haoqian Meng, Yilun Luo, Peng Zhang, and Xindian Ma. Micromix: Efficient mixed-precision quantization with microscaling formats for large language models.arXiv preprint arXiv:2508.02343,

  14. [14]

    Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544, 2024

    Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models. arXiv preprint arXiv:2403.12544, 2024a. Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Outlier-aware slicing ...

  15. [15]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391,

  16. [16]

    Block rotation is all you need for mxfp4 quantization

    Yuantian Shao, Peisong Wang, Yuanteng Chen, Chang Xu, Zhihui Wei, and Jian Cheng. Block rotation is all you need for mxfp4 quantization.arXiv preprint arXiv:2511.04214,

  17. [17]

    Z., and Liu, Z

    9 Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024a. Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization.arXiv preprint arXiv:2410.09426, 2024b. Ajay Ti...

  18. [18]

    Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,

    Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396, 2024a. Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. Qtip: Quantization with trellises and incoherence processing.Advances in Neural Informa...

  19. [19]

    Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling

    Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1648–1665,

  20. [20]

    Automated fine-grained mixture-of-experts quantization

    Zhanhao Xie, Yuexiao Ma, Xiawu Zheng, Fei Chao, Wanchen Sui, Yong Li, Shen Li, and Rongrong Ji. Automated fine-grained mixture-of-experts quantization. InFindings of the Association for Computational Linguistics: ACL 2025, pages 27024–27037,

  21. [21]

    Prune as you generate: Online rollout pruning for faster and better RLVR.arXiv preprint arXiv:2603.24840, 2026

    Haobo Xu, Sirui Chen, Ruizhong Qiu, Yuchen Yan, Chen Luo, Monica Cheng, Jingrui He, and Hanghang Tong. Prune as you generate: Online rollout pruning for faster and better rlvr.arXiv preprint arXiv:2603.24840,

  22. [22]

    Dopq-vit: Towards distribution-friendly and outlier-aware post-training quantization for vision transform- ers.arXiv preprint arXiv:2408.03291, 2024

    Lianwei Yang, Haisong Gong, Haokun Lin, Yichen Wu, Zhenan Sun, and Qingyi Gu. Dopq-vit: Towards distribution-friendly and outlier-aware post-training quantization for vision transformers. arXiv preprint arXiv:2408.03291,

  23. [23]

    Lrq-dit: Log-rotation post-training quantization of dif- fusion transformers for image and video generation,

    Lianwei Yang, Haokun Lin, Tianchen Zhao, Yichen Wu, Hongyu Zhu, Ruiqi Xie, Zhenan Sun, Yu Wang, and Qingyi Gu. Lrq-dit: Log-rotation post-training quantization of diffusion transformers for text-to-image generation.arXiv preprint arXiv:2508.03485,

  24. [24]

    QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

    Jingxuan Zhang, Yunta Hsieh, Zhongwei Wang, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, and Mi Zhang. Quantvla: Scale-calibrated post-training quantization for vision-language-action models.arXiv preprint arXiv:2602.20309,