HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models
Pith reviewed 2026-06-26 08:44 UTC · model grok-4.3
The pith
HyperQuant achieves superior rate-distortion tradeoffs for LLM and diffusion model quantization by applying a per-tile randomized Hadamard transform before lattice quantization and Rice coding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HyperQuant demonstrates that the combination of a per-tile randomized Hadamard transform, quantization onto low-dimensional lattices (E8, D4, A2, Z), variable-length Rice coding, and KV bias correction produces lower distortion at given rates than existing methods, yielding 3.9x compression on linear weights and 3.79x on KV cache at 4 bits per scalar with near-lossless end-to-end quality on an H100.
What carries the argument
The per-tile randomized Hadamard transform that reshapes weight and activation distributions to be approximately Gaussian, enabling near-optimal lattice quantization followed by Rice coding.
If this is right
- The method outperforms HIGGS at every operating point between 3 and 5 bits per scalar on weights.
- It exceeds TurboQuant and OCTOPUS performance on KV-cache quantization down to 1.7 bits per scalar.
- End-to-end compression reaches approximately 3.9 times for linear weights and 3.79 times for KV cache at 4 bits per scalar while remaining near lossless on H100 hardware.
- The same pipeline quantizes a 19B-parameter diffusion transformer video model with no observable per-frame artifacts.
Where Pith is reading between the lines
- The memory savings could allow larger batch sizes or longer context lengths on the same GPU without retraining.
- Because the pipeline integrates directly with existing 8-bit and 4-bit Tensor-Core paths, it may accelerate inference on current accelerator hardware without new silicon.
- The Gaussianizing property of the transform might be reusable as a preprocessing step for other quantization or compression techniques that assume Gaussian inputs.
Load-bearing premise
The randomized Hadamard transform produces coordinate distributions close enough to Gaussian that the selected low-dimensional lattices achieve near-minimal quantization error, and the KV bias correction preserves the semantics of attention inner products.
What would settle it
Measure the actual mean-squared error after applying the randomized Hadamard transform and lattice quantization to a real transformer weight tile and compare it against the known minimum distortion of the lattice at that rate; a large gap would falsify the near-optimality claim.
Figures
read the original abstract
We present HyperQuant (Hadamard, optimallY Packing, Entropy Rice-coding), a unified post-training quantization pipeline for the weights and the KV cache of large language and diffusion transformers. Across a suite of self-contained experiments (Table 1), HyperQuant outperforms the recent HIGGS scheme at every operating point from 3 to 5 bits per scalar (bps) on weights, and beats both TurboQuant and OCTOPUS on KV quantization down to 1.7 bps. Beyond the LLM setting, HyperQuant quantizes the 19B-parameter LTX-2 DiT video model with no observable per-frame artifacts. End-to-end on an H100 at 4 bps, HyperQuant compresses the linear weights ~3.9x and the KV cache ~3.79x at near-lossless quality. HyperQuant combines four known ideas into a single construction: (i) a per-tile Randomized Hadamard Transform that makes the per-coordinate distribution of weights and activations approximately Gaussian; (ii) quantization to a low-dimensional optimal lattice (E8, D4, A2, or Z); (iii) lossless bit-stripping and near-entropy-optimal variable-length Rice coding of the lattice indices; and (iv) bias-correction methods for the KV cache that keep the reconstruction unbiased under inner products, preserving attention semantics. We further integrate the pipeline with 8-bit and 4-bit Tensor-Core MMA paths (fp8-e4m3, int8, nvfp4, mxfp4), and find that int8 beats fp8 on the post-RHT lattice output. Project page: https://moonmath.ai/hyperquant/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HyperQuant, a post-training quantization pipeline for LLM and diffusion model weights and KV caches. It combines a per-tile Randomized Hadamard Transform (RHT) to produce approximately Gaussian marginals, quantization onto low-dimensional lattices (E8, D4, A2, Z), lossless Rice coding of lattice indices, and KV bias correction to preserve inner-product semantics. Experiments (Table 1) claim consistent outperformance versus HIGGS (weights, 3–5 bps), TurboQuant and OCTOPUS (KV, down to 1.7 bps), plus artifact-free quantization of the 19B LTX-2 DiT model and ~3.9× / 3.79× compression on H100 at 4 bps with near-lossless quality. The pipeline is also integrated with 8-bit/4-bit Tensor-Core paths.
Significance. If the reported gains are reproducible and the lattice advantage is shown to stem from the RHT-induced near-Gaussianity rather than from unstated implementation details, the work would supply a concrete rate-distortion construction that unifies weight and KV quantization while remaining hardware-aware. The explicit integration with fp8/int8/nvfp4/mxfp4 MMA paths and the self-contained experimental suite are practical strengths. The absence of quantitative distribution diagnostics, however, leaves open whether the headline superiority can be credited to the claimed optimality of the lattices.
major comments (2)
- [Abstract / method section] Abstract and method description: the central performance claims rest on the assertion that per-tile RHT yields “approximately Gaussian” per-coordinate distributions for which E8/D4/A2/Z lattices are near-optimal. No kurtosis, skewness, Kolmogorov–Smirnov statistic, or excess-kurtosis values are supplied for the actual weight or KV tensors, nor is any tile-wise variation quantified. Because the rate-distortion optimality argument is load-bearing for attributing gains over HIGGS/TurboQuant/OCTOPUS to the lattice step rather than to other pipeline components, this verification is required.
- [Table 1] Table 1 and experimental protocol: the abstract states that HyperQuant “outperforms … at every operating point,” yet the provided text supplies neither error bars, number of random seeds, exclusion criteria for tiles, nor the precise definition of “near-lossless quality.” These details are necessary to assess whether the reported 3–5 bps and 1.7 bps operating points are robust.
minor comments (1)
- [Abstract] The abstract mentions integration with “fp8-e4m3, int8, nvfp4, mxfp4” but does not state which of these paths was used for the headline H100 numbers; a single clarifying sentence would remove ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments correctly identify areas where additional quantitative support would strengthen the attribution of gains to the lattice step and improve experimental transparency. We respond point-by-point below.
read point-by-point responses
-
Referee: [Abstract / method section] Abstract and method description: the central performance claims rest on the assertion that per-tile RHT yields “approximately Gaussian” per-coordinate distributions for which E8/D4/A2/Z lattices are near-optimal. No kurtosis, skewness, Kolmogorov–Smirnov statistic, or excess-kurtosis values are supplied for the actual weight or KV tensors, nor is any tile-wise variation quantified. Because the rate-distortion optimality argument is load-bearing for attributing gains over HIGGS/TurboQuant/OCTOPUS to the lattice step rather than to other pipeline components, this verification is required.
Authors: We agree that the manuscript would benefit from explicit distribution diagnostics to support the near-Gaussianity claim and to strengthen the link between the RHT and lattice optimality. While the theoretical motivation for RHT is standard in the literature, we did not report kurtosis, skewness, or KS statistics. In revision we will add these metrics (kurtosis, excess kurtosis, skewness, and KS p-values) computed on representative tiles from both weight and KV tensors, plus a brief quantification of tile-to-tile variation. This will allow direct assessment of how closely the post-RHT marginals match the Gaussian assumption underlying the lattice choice. revision: yes
-
Referee: [Table 1] Table 1 and experimental protocol: the abstract states that HyperQuant “outperforms … at every operating point,” yet the provided text supplies neither error bars, number of random seeds, exclusion criteria for tiles, nor the precise definition of “near-lossless quality.” These details are necessary to assess whether the reported 3–5 bps and 1.7 bps operating points are robust.
Authors: The post-training quantization pipeline is deterministic once the per-tile RHT seeds are fixed; we used a single fixed seed per model for full reproducibility, so classical error bars over random seeds do not apply. We will add an explicit statement of this fact, report the seed value, state that no tiles were excluded, and define “near-lossless” quantitatively (perplexity within 0.5 % of the unquantized baseline for LLMs; no visible per-frame artifacts plus PSNR > 38 dB for the DiT model). These clarifications will be inserted into the experimental protocol section and Table 1 caption. revision: yes
Circularity Check
No circularity: constructive pipeline validated on external baselines
full rationale
The paper presents HyperQuant as a pipeline that combines four known techniques (per-tile RHT, low-dimensional lattices, Rice coding, KV bias correction) and reports empirical gains against external baselines (HIGGS, TurboQuant, OCTOPUS) in self-contained experiments. No equations, parameters, or self-citations reduce any reported performance metric or optimality claim to a quantity fitted or defined inside the same paper. The Gaussianity assumption is an empirical premise, not a self-referential derivation step.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Per-tile randomized Hadamard transform produces approximately Gaussian coordinate distributions for weights and activations.
- domain assumption Bias-correction methods keep reconstruction unbiased under inner products and thereby preserve attention semantics.
Reference graph
Works this paper leans on
-
[1]
The Llama-3 herd of models
Meta AI. The Llama-3 herd of models. Meta AI research publication, 2024. URLhttps: //ai.meta.com/research/publications/the-llama-3-herd-of-models/
2024
-
[2]
The fast Johnson–Lindenstrauss transform and approximate nearest neighbors.SIAM Journal on Computing, 39(1):302–322, 2009
Nir Ailon and Bernard Chazelle. The fast Johnson–Lindenstrauss transform and approximate nearest neighbors.SIAM Journal on Computing, 39(1):302–322, 2009
2009
-
[3]
Slicegpt: Compress large language models by deleting rows and columns
Saleh Ashkboos, Maximilian L. Croci, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. InICLR, 2024. URLhttps: //arxiv.org/abs/2401.15024
-
[4]
Mark Boss, Vikram Voleti, Simon Donné, and Shimon Vainer. Octopus: Optimized kv cache for transformers via octahedral parametrization under optimal squared error quantization, 2026. URLhttps://arxiv.org/abs/2605.21226
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
QuIP: 2-bit quantization of large language models with guarantees
Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InNeurIPS, 2023. URLhttps://arxiv.org/abs/ 2307.13304
-
[6]
Conway and Neil J
John H. Conway and Neil J. A. Sloane.Sphere Packings, Lattices and Groups. Springer, 3rd edition, 1999
1999
-
[7]
Cover and Joy A
Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley, 2nd edition, 2006
2006
-
[8]
A sparse Johnson–Lindenstrauss transform
Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse Johnson–Lindenstrauss transform. InSTOC, 2010
2010
-
[9]
SpQR: A sparse-quantized representation for near-lossless LLM weight compression
Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. InICLR, 2024. URLhttps://arxiv. org/abs/2306.03078
-
[10]
Jarek Duda. Asymmetric numeral systems: Entropy coding combining speed of Huffman coding with compression rate of arithmetic coding.arXiv preprint arXiv:1311.2540, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[11]
On the closeness of the random-dither mapping to the information- theoretic optimum for vector quantization.IEEE Transactions on Information Theory, 51(10): 3617–3631, 2005
Uri Erez and Ram Zamir. On the closeness of the random-dither mapping to the information- theoretic optimum for vector quantization.IEEE Transactions on Information Theory, 51(10): 3617–3631, 2005
2005
-
[12]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024. URLhttps://arxiv.org/abs/2403.03206
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
David Forney and Lee-Fang Wei
G. David Forney and Lee-Fang Wei. Multidimensional constellations—Part I: Introduction, figures of merit, and generalized cross constellations.IEEE Journal on Selected Areas in Communications, 7(6):877–892, 1989
1989
-
[14]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post- training quantization for generative pre-trained transformers. InICLR, 2023. URLhttps: //arxiv.org/abs/2210.17323. 29
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Herbert Gish and John N. Pierce. Asymptotically efficient quantizing.IEEE Transactions on Information Theory, 14(5):676–683, 1968
1968
-
[16]
Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM Review, 53(2):217–288, 2011. URLhttps://arxiv.org/abs/0909.4061
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[17]
LTX-Video: A real-time video generation model
Lightricks. LTX-Video: A real-time video generation model. GitHub repository, 2024. URL https://github.com/Lightricks/LTX-Video
2024
-
[18]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InMLSys, 2024. URLhttps://arxiv.org/ abs/2306.00978
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Lookabaugh and Robert M
Thomas D. Lookabaugh and Robert M. Gray. High-resolution quantization theory and the vector quantizer advantage.IEEE Transactions on Information Theory, 35(5):1020–1033, 1989
1989
-
[20]
Pushing the limits of large language model quantization via the linearity theorem
Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, and Dan Alistarh. Pushing the limits of large language model quantization via the linearity theorem. InNAACL,
- [21]
-
[22]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InICLR, 2017. URLhttps://arxiv.org/abs/1609.07843
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
URLhttps://arxiv.org/abs/2209.05433
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
NVIDIA hopper h100 architecture white paper
NVIDIA. NVIDIA hopper h100 architecture white paper. NVIDIA white paper, 2022. URL https://resources.nvidia.com/en-us-tensor-core
2022
-
[26]
NVIDIA blackwell architecture technical brief
NVIDIA. NVIDIA blackwell architecture technical brief. NVIDIA technical brief, 2024. URL https://resources.nvidia.com/en-us-blackwell-architecture
2024
-
[27]
OCP Microscaling Formats (MX) specification v1.0
Open Compute Project. OCP Microscaling Formats (MX) specification v1.0. OCP specification, 2023. URL https://www.opencompute.org/documents/ ocp-microscaling-formats-mx-v1-0-spec-final-pdf
2023
-
[28]
Optimal quadratic quantization for numerics: The Gaussian case.Monte Carlo Methods and Applications, 9(2):135–165, 2003
Gilles Pagès and Jacques Printems. Optimal quadratic quantization for numerics: The Gaussian case.Monte Carlo Methods and Applications, 9(2):135–165, 2003
2003
-
[29]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. URLhttps://arxiv.org/abs/2212.09748
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Efficiently scaling transformer inference
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xu, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In MLSys, 2023. URLhttps://arxiv.org/abs/2211.05102
-
[31]
Robert F. Rice. Some practical universal noiseless coding techniques.JPL Publication 79-22, 1979
1979
-
[32]
Microscaling data formats for deep learning, 2023 b
Bita Darvish Rouhani, Nitin Garegrat, Tom Madian, Jeremy Lo, Brian Cook, Daniel Pinto, et al. Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023. URLhttps://arxiv.org/abs/2310.10537. 30
-
[33]
NestQuant: Nested lattice quantization for matrix products and LLMs
Semyon Savkin, Eitan Porat Chen, Or Lou, and Yury Polyanskiy. NestQuant: Nested lattice quantization for matrix products and LLMs. InICML, 2025. URLhttps://arxiv.org/abs/ 2502.09720
-
[34]
Dither signals and their effect on quantization noise.IEEE Transactions on Communication Technology, 12(4):162–165, 1964
Leonard Schuchman. Dither signals and their effect on quantization noise.IEEE Transactions on Communication Technology, 12(4):162–165, 1964
1964
-
[35]
Omniquant: Omnidirectionally calibrated quantization for large language models, 2024
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantiza- tion for large language models. InICLR, 2024. URLhttps://arxiv.org/abs/2308.13137
-
[36]
QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks
Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InICML,
- [37]
-
[38]
Massively parallel Huffman decoding on GPUs
André Weißenberger and Bertil Schmidt. Massively parallel Huffman decoding on GPUs. In Proceedings of the 47th International Conference on Parallel Processing (ICPP), 2018
2018
-
[39]
Sullivan, Gisle Bjøntegaard, and Ajay Luthra
Thomas Wiegand, Gary J. Sullivan, Gisle Bjøntegaard, and Ajay Luthra. Overview of the H.264/AVC video coding standard.IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560–576, 2003
2003
-
[40]
Smoothquant: Accurate and efficient post-training quantization for large language models, 2024
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In ICML, 2023. URLhttps://arxiv.org/abs/2211.10438
-
[41]
Cambridge University Press, 2014
Ram Zamir.Lattice Coding for Signals and Networks. Cambridge University Press, 2014
2014
-
[42]
On universal quantization by randomized uniform/lattice quantizers
Ram Zamir and Meir Feder. On universal quantization by randomized uniform/lattice quantizers. IEEE Transactions on Information Theory, 38(2):428–436, 1992
1992
-
[43]
Amir Zandieh, Majid Daliri, and Insu Han. QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead.arXiv preprint arXiv:2406.03482, 2024. URLhttps: //arxiv.org/abs/2406.03482
-
[44]
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate. InICLR, 2026. URL https://arxiv.org/ abs/2504.19874. A Proof of subtractive-dither unbiasedness This appendix gives a self-contained proof that subtractive-dithered lattice quantization satisfies the Schuchman conditions ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.