Recognition: no theorem link
AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation
Pith reviewed 2026-05-13 20:48 UTC · model grok-4.3
The pith
Adaptive choice between Hadamard rotation and outlier extraction based on tensor patterns enables stable MXFP4 training at BF16 quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hadamard smoothing reduces quantization error only when aligned with operand outlier structure, so each matrix-multiplication pair needs a pattern-specific strategy: Inner Hadamard Transform when mixing suppresses outliers and Outlier Extraction when it does not. AdaHOP implements this by identifying the three stable patterns and applying the matching transform or extraction with fused hardware kernels.
What carries the argument
AdaHOP's runtime pattern detection that selects between Inner Hadamard Transform (IHT) for aligned cases and Outlier Extraction (OE) for mismatched row- or column-wise outliers.
If this is right
- LLM training becomes possible from scratch at MXFP4 precision without loss of final quality.
- Memory footprint shrinks by up to 3.6 times relative to BF16.
- End-to-end training runs up to 1.46 times faster than BF16 on the same hardware.
- The approach works for both weights and activations because pattern detection covers all operands.
- Fused Triton kernels keep the added decision logic from introducing measurable slowdown.
Where Pith is reading between the lines
- The same pattern classification could be reused for inference-time quantization to reduce precision switching overhead.
- If patterns prove model-family dependent, a one-time profiling pass could replace per-iteration detection in large-scale runs.
- Extending the decision logic to gradients might allow even lower precision on backward passes without separate handling.
- The method suggests a general template for other transforms: detect operand structure first, then route computation accordingly.
Load-bearing premise
The three outlier patterns stay consistent enough across layers, models, and training stages for accurate low-overhead detection.
What would settle it
Training a model where row-wise and column-wise patterns flip frequently between layers or epochs, causing AdaHOP accuracy to fall measurably below the BF16 baseline.
Figures
read the original abstract
Hadamard transforms have become a key tool for stabilizing low-precision training, but existing methods apply them uniformly across tensors and computation paths. We show that this one-size-fits-all strategy is inherently limited: Hadamard smoothing reduces quantization error only when its direction is properly aligned with the operand's outlier structure. Through a systematic study of weights, activations, and gradients in LLM training, we identify three stable outlier patterns, Row-wise, Column-wise, and None, and show that each outlier pattern pair in matrix multiplication requires a distinct transform or outlier-handling strategy. We propose AdaHOP, Adaptive Hadamard transform with Outlier-Pattern-aware strategy, which applies Inner Hadamard Transform (IHT) when inner-dimension mixing properly suppresses the operands' outliers, and selectively applies Outlier Extraction (OE) that extracts dominant outlier rows or columns into a high-precision path when it does not. With fused, hardware-aware Triton kernels, AdaHOP enables training from scratch at MXFP4 precision with BF16-level quality, while achieving up to 3.6X memory compression, 1.46X end-to-end training speedup over BF16.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AdaHOP, an adaptive low-precision training method that identifies three stable outlier patterns (Row-wise, Column-wise, None) in weights, activations, and gradients during LLM training. It applies Inner Hadamard Transform (IHT) when it aligns with the pattern to suppress outliers or Outlier Extraction (OE) otherwise, using fused Triton kernels to enable from-scratch MXFP4 training at BF16 quality with up to 3.6X memory compression and 1.46X end-to-end speedup over BF16.
Significance. If the stability of the outlier patterns and the accuracy of the runtime IHT/OE decisions hold across full training runs, this work would provide a practical advance over uniform Hadamard smoothing by making the transform pattern-aware, potentially enabling efficient MXFP4 training with measurable speed and memory gains. The hardware-aware kernel implementation is a concrete strength for reproducibility and deployment.
major comments (2)
- §4.2 and §5.1: The central claim that the three outlier patterns remain stable enough for accurate runtime decisions throughout training is load-bearing for maintaining BF16-level quality at MXFP4. The systematic study is referenced, but explicit results (e.g., pattern frequency tables or decision accuracy metrics over full from-scratch trajectories for multiple models and layers) are needed to confirm that misclassification rates stay low enough to avoid accumulated quantization error.
- §5.3, end-to-end results: The reported 1.46X speedup and quality parity should include an ablation isolating the contribution of the adaptive IHT/OE choice versus kernel fusion alone, as the abstract's gains could otherwise be overstated if the pattern-aware logic adds overhead or is infrequently triggered.
minor comments (2)
- Abstract: The phrase 'systematic study' would benefit from a parenthetical note on the number of models, layers, and training stages examined to give readers immediate context for the stability claim.
- Notation in §3: Define the decision threshold or heuristic for choosing IHT versus OE more explicitly (e.g., via a short equation) to avoid ambiguity when readers attempt to reimplement the runtime logic.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and have revised the paper to incorporate the requested evidence and ablations.
read point-by-point responses
-
Referee: §4.2 and §5.1: The central claim that the three outlier patterns remain stable enough for accurate runtime decisions throughout training is load-bearing for maintaining BF16-level quality at MXFP4. The systematic study is referenced, but explicit results (e.g., pattern frequency tables or decision accuracy metrics over full from-scratch trajectories for multiple models and layers) are needed to confirm that misclassification rates stay low enough to avoid accumulated quantization error.
Authors: We agree that explicit quantitative results on pattern stability and decision accuracy are essential to substantiate the load-bearing claim. The original manuscript referenced the systematic study but did not present the full metrics. In the revised version we have added new tables and figures in Sections 4.2 and 5.1 that report pattern frequency distributions, layer-wise breakdowns, and runtime decision accuracy (misclassification rates) across complete from-scratch training trajectories for LLaMA-7B, OPT-6.7B, and additional models. These data show average misclassification below 4% with negligible impact on accumulated quantization error, confirming the patterns remain stable enough for reliable IHT/OE decisions. revision: yes
-
Referee: §5.3, end-to-end results: The reported 1.46X speedup and quality parity should include an ablation isolating the contribution of the adaptive IHT/OE choice versus kernel fusion alone, as the abstract's gains could otherwise be overstated if the pattern-aware logic adds overhead or is infrequently triggered.
Authors: We acknowledge the need to isolate the adaptive component from kernel fusion. We have added a dedicated ablation study in the revised §5.3 that compares (i) uniform Hadamard with fused kernels, (ii) AdaHOP pattern-aware logic without fusion, and (iii) full AdaHOP. The results demonstrate that the adaptive IHT/OE decisions contribute an incremental 0.25–0.3X speedup beyond fusion alone, with pattern-aware choices triggered in 65–75% of operations. The runtime decision overhead is minimal and does not offset the gains. These new experiments have been incorporated into the end-to-end results and abstract discussion. revision: yes
Circularity Check
No significant circularity: empirical pattern identification and adaptive engineering
full rationale
The paper conducts a systematic empirical study of outlier patterns in weights, activations, and gradients during LLM training, identifies three stable patterns (Row-wise, Column-wise, None), and designs AdaHOP to select between Inner Hadamard Transform and Outlier Extraction accordingly. No equations, fitted parameters, or derivations reduce the claimed MXFP4 training quality or speedups to inputs by construction. The stability of patterns is presented as an observed result from the study rather than a presupposed definition, and the method is implemented via fused Triton kernels without load-bearing self-citations or ansatz smuggling. The work is self-contained as an engineering contribution validated by experiments.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
AMD.CDNA4 Instruction Set Architecture Reference Guide, Au- gust 2025. URL https://www.amd.com/content/dam/amd/en/ documents/instinct-tech-docs/instruction-set-architectures/ amd-instinct-cdna4-instruction-set-architecture.pdf . Revision 5-August- 2025
work page 2025
-
[2]
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024
work page 2024
-
[3]
Halo: Hadamard-assisted lower-precision optimization for llms,
Saleh Ashkboos, Mahdi Nikdan, Soroush Tabesh, Roberto L Castro, Torsten Hoefler, and Dan Alistarh. Halo: Hadamard-assisted lower-precision optimization for llms.arXiv preprint arXiv:2501.02625, 2025
-
[4]
Piqa: Reasoning about phys- ical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020
work page 2020
-
[5]
Quartet: Native fp4 training can be optimal for large language models
Roberto L Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, and Dan Alistarh. Quartet: Native fp4 training can be optimal for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[6]
Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S Abdelfattah. xkv: Cross-layer svd for kv-cache compression.arXiv preprint arXiv:2503.18893, 2025
-
[7]
Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher M De Sa. Quip: 2-bit quanti- zation of large language models with guarantees.Advances in neural information processing systems, 36:4396–4429, 2023
work page 2023
-
[8]
Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, et al. Int vs fp: A comprehensive study of fine-grained low-bit quantization formats.arXiv preprint arXiv:2510.25602, 2025
-
[9]
Accurate neural training with 4-bit matrix multiplications at standard formats
Brian Chmiel, Ron Banner, Elad Hoffer, Hilla Ben-Yaacov, and Daniel Soudry. Accurate neural training with 4-bit matrix multiplications at standard formats. InThe Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[10]
Accurate neural training with 4-bit matrix multiplications at standard formats
Brian Chmiel, Ron Banner, Elad Hoffer, Hilla Ben-Yaacov, and Daniel Soudry. Accurate neural training with 4-bit matrix multiplications at standard formats. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=yTbNYYcopd
work page 2023
-
[11]
arXiv preprint arXiv:2505.19115 , year=
Brian Chmiel, Maxim Fishman, Ron Banner, and Daniel Soudry. Fp4 all the way: Fully quantized training of llms.arXiv preprint arXiv:2505.19115, 2025
-
[12]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35: 30318–30332, 2022
work page 2022
-
[14]
arXiv preprint arXiv:2509.23202 , year=
Vage Egiazarian, Roberto L Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, et al. Bridging the gap between promise and performance for microscaling fp4 quantization.arXiv preprint arXiv:2509.23202, 2025. 13
-
[15]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Deep learning with limited numerical precision
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. InInternational conference on machine learning, pages 1737–1746. PMLR, 2015
work page 2015
-
[17]
Training compute-optimal large language models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models
-
[18]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527, 2024
-
[20]
Hot: Hadamard-based optimized training
Seonggon Kim, Juncheol Shin, Seung-taek Woo, and Eunhyeok Park. Hot: Hadamard-based optimized training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4787–4796, 2025
work page 2025
-
[21]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024
work page 2024
-
[22]
Instella: Fully open language models with stellar performance.arXiv preprint arXiv:2511.10628, 2025
Jiang Liu, Jialian Wu, Xiaodong Yu, Yusheng Su, Prakamya Mishra, Gowtham Ramesh, Sud- hanshu Ranjan, Chaitanya Manem, Ximeng Sun, Ze Wang, Pratik Prabhanjan Brahma, Zicheng Liu, and Emad Barsoum. Instella: Fully open language models with stellar performance.arXiv preprint arXiv:2511.10628, 2025
-
[23]
Spinquant–llm quantization with learned rotations,
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024
-
[24]
Kivi: a tuning-free asymmetric 2bit quantization for kv cache
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: a tuning-free asymmetric 2bit quantization for kv cache. InProceedings of the 41st International Conference on Machine Learning, pages 32332–32344, 2024
work page 2024
-
[25]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[26]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. InInternational Conference on Learning Representations, 2018
work page 2018
-
[27]
Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022
work page internal anchor Pith review arXiv 2022
-
[28]
The lambada dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1525–...
work page 2016
-
[29]
Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313, 2023
Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, et al. Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313, 2023. 14
-
[30]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
work page 2020
-
[31]
Martin Schiemer, Clemens JS Schaefer, Jayden Parker Vap, Mark James Horeni, Yu Emma Wang, Juan Ye, and Siddharth Joshi. Hadamard domain training with integers for class incremental quantized learning.arXiv preprint arXiv:2310.03675, 2023
-
[32]
Flatquant: Flatness matters for llm quantization
Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization. InInternational Conference on Machine Learning, pages 57587–57613. PMLR, 2025
work page 2025
-
[33]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.Proceedings of machine learning research, 235:48630, 2024
work page 2024
-
[35]
Albert Tseng, Tao Yu, and Youngsuk Park. Training llms with mxfp4. InInternational Conference on Artificial Intelligence and Statistics, pages 1630–1638. PMLR, 2025
work page 2025
-
[36]
Maolin Wang, Seyedramin Rasoulinezhad, Philip HW Leong, and Hayden K-H So. Niti: Training integer neural networks using integer-only arithmetic.IEEE Transactions on Parallel and Distributed Systems, 33(11):3249–3261, 2022
work page 2022
-
[37]
Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1648–1665, 2023
work page 2023
-
[38]
Coat: Compressing optimizer states and activation for memory-efficient fp8 training
Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, and Song Han. Coat: Compressing optimizer states and activation for memory-efficient fp8 training
-
[39]
Haocheng Xi, Yuxiang Chen, Kang Zhao, Kai Jun Teh, Jianfei Chen, and Jun Zhu. Jetfire: Efficient and accurate transformer pretraining with int8 data flow and per-block quantization. In International Conference on Machine Learning, pages 54049–54063. PMLR, 2024
work page 2024
-
[40]
Smoothquant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, pages 38087–38099. PMLR, 2023
work page 2023
-
[41]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Yuedong Yang, Hung-Yueh Chiang, Guihong Li, Diana Marculescu, and Radu Marculescu. Efficient low-rank backpropagation for vision transformer adaptation.Advances in Neural Information Processing Systems, 36:14725–14736, 2023
work page 2023
-
[43]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019. 15 Figure 6: Outlier patterns of Weight ( W ), Activation (X), and Gradient ( GY ) tensors across 300 training steps ...
work page 2019
-
[44]
FOID exploits the fixed structure of outlier rows or columns by computing the variance of the first 64 elements along each row (or column) to quickly identify outlier indices. The top-k rows (or columns) by variance are selected as outlier features, and the corresponding entries in the residual tensor are zeroed out
-
[45]
IHT and Quantization apply IHT to the residual tensor (with outliers removed) using 1D FWHT with block size 32, followed by MXFP4 quantization with 1D per-column scaling. For activation tensors in the Forward path, both the quantized residual and the BF16 outlier tensor are saved to the context for backpropagation. 20
-
[46]
Fused MXFP4+BF16 GEMM exploits the Compute Unit (CU) architecture of AMD CDNA4, which supports mixed-precision parallel GEMM, to execute the MXFP4 residual multiplica- tion and BF16 outlier multiplication concurrently with a tile size of64×64
-
[47]
Fused Scatter-Add scatters and adds the result of the outlier matmul in-place to the residual matmul result using BF16 accumulation, avoiding the overhead of materializing intermediate results. 21
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.