pith. sign in

arxiv: 2605.18800 · v1 · pith:R6VLVVKYnew · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Theory-optimal Quantization Based on Flatness

Pith reviewed 2026-05-20 22:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords post-training quantizationlarge language modelsactivation outliersflatness metricbidirectional diagonal quantizationmodel compressionlow-bit inference
0
0 comments X

The pith

Flatness analysis yields an optimal bidirectional diagonal transformation that disperses LLM activation outliers for low-bit quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models lose accuracy during post-training quantization mainly because activation outliers produce large rounding errors at low precision. The authors first express quantization error in terms of outlier magnitude distribution, then introduce a Flatness metric that measures how concentrated those large values remain after a linear transform. They derive the closed-form transformation that minimizes Flatness and show it can be realized by learned diagonal matrices applied in both row and column directions. Bidirectional Diagonal Quantization implements this optimum, spreading outlier energy across matrix dimensions so that standard rounding incurs less damage. Experiments confirm the result: less than 1 percent accuracy drop at W4A4 on LLaMA-3-8B and a 39.1 percent reduction of the remaining gap versus prior methods at W2A4KV16 on a 70B model.

Core claim

The paper establishes that quantization error is governed by the distribution of outliers, which is quantified by a Flatness measure; the linear transformation minimizing this measure is theoretically optimal, and BDQ approximates it by applying separate learned diagonal matrices to weights and activations so that outlier magnitudes are redistributed across dimensions and the effective rounding error decreases.

What carries the argument

The Flatness metric, which quantifies the concentration of outlier magnitudes after transformation, together with the bidirectional diagonal matrices that achieve its theoretical minimum.

If this is right

  • BDQ achieves less than 1% accuracy drop in W4A4 quantization on the LLaMA-3-8B model.
  • BDQ reduces the performance gap by 39.1% compared to state-of-the-art in the W2A4KV16 setting on DeepSeek-R1-Distill-LLaMA-70B.
  • The transformed weights and activations exhibit more dispersed outlier patterns with less concentrated magnitude distributions.
  • The diagonal transformations can be absorbed into the model weights for inference with no extra cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same flatness minimization could be used to guide quantization of the key-value cache without retraining.
  • The bidirectional construction may generalize to other structured linear transforms such as low-rank adapters.
  • One could test whether the derived optimum remains stable when the outlier statistics shift across different calibration datasets.

Load-bearing premise

The modeling of the mathematical relationship between quantization error and outliers allows derivation of a theoretical optimal solution that can be realized in practice through learned bidirectional diagonal matrix transformations.

What would settle it

Compute the actual quantization error on the LLaMA-3-8B calibration set before and after applying the learned bidirectional diagonal matrices and check whether the reduction matches the amount predicted by the flatness formula.

Figures

Figures reproduced from arXiv: 2605.18800 by Dong Li, Emad Barsoum, Kang Liu, Lu Wang, Xiusheng Huang, Xuanwu Yin, Yequan Wang, Zhe Li.

Figure 1
Figure 1. Figure 1: Activation distributions under different transformations for LLaMA3-8B. After quantization, values from [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The transformation results of different methods. The Rotation Matrix is a learnable random Hadamard [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Post-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models (LLMs). The primary challenges in LLMs quantization stem from activation outliers, which significantly degrade model performance especially at lower bit precision. While recent approaches attempt to mitigate outliers through linear transformations across feature dimensions, our analysis reveals that the transformed weights and activations still exhibit persistent outlier patterns with concentrated magnitude distributions. In this paper, we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers. Based on this, we derive the theoretical optimal solution with respect to Flatness. Building on these insights, we propose Bidirectional Diagonal Quantization (BDQ), a novel post-training quantization framework that effectively disperses outlier patterns through optimized matrix transformations. BDQ strategically distributes outlier magnitudes across matrix dimensions via learned diagonal operations. Extensive experiments demonstrate that BDQ establishes a new quantization benchmark. It achieves less than 1\% accuracy drop in W4A4 quantization on the LLaMA-3-8B model. In the more challenging W2A4KV16 experiment, compared to state-of-the-art approaches, BDQ reduces the performance gap by 39.1\% on the DeepSeek-R1-Distill-LLaMA-70B model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that by modeling the mathematical relationship between quantization error and outliers in LLMs, introducing a Flatness metric to quantify outlier distributions, deriving a theoretical optimal solution, and realizing it via Bidirectional Diagonal Quantization (BDQ) with learned bidirectional diagonal matrix transformations, they achieve state-of-the-art post-training quantization. Key results include less than 1% accuracy drop in W4A4 on LLaMA-3-8B and a 39.1% reduction in the performance gap versus SOTA in the W2A4KV16 setting on DeepSeek-R1-Distill-LLaMA-70B.

Significance. If the derivation holds and the learned transformations realize the claimed optimum, this would offer a principled, theoretically grounded method for outlier mitigation in LLM quantization, moving beyond purely heuristic linear transformations. The reported empirical gains in challenging low-bit regimes indicate potential practical impact for efficient LLM deployment, provided the theory-practice link is substantiated.

major comments (1)
  1. [Method / Theoretical Derivation] The central claim rests on deriving a theoretical optimum w.r.t. Flatness and asserting that BDQ's learned bidirectional diagonal transformations realize it exactly (or closely approximate it). No verification is provided—such as a comparison of the optimized diagonal entries against the closed-form solution or a convergence analysis—leaving open whether the accuracy gains follow from the theory or are empirical. This is load-bearing for the 'theory-optimal' framing and the reported 39.1% gap reduction.
minor comments (2)
  1. [Experiments] Experimental results (e.g., W4A4 on LLaMA-3-8B) are presented without error bars, standard deviations across runs, or ablation studies isolating the bidirectional diagonal components, which would help assess robustness.
  2. [Abstract] The abstract states that a mathematical relationship is modeled and an optimum derived but includes no equations or proof outline; adding a key equation or high-level derivation sketch would aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The major comment raises an important point about substantiating the connection between our theoretical derivation and the BDQ implementation. We address this below and commit to revisions that will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method / Theoretical Derivation] The central claim rests on deriving a theoretical optimum w.r.t. Flatness and asserting that BDQ's learned bidirectional diagonal transformations realize it exactly (or closely approximate it). No verification is provided—such as a comparison of the optimized diagonal entries against the closed-form solution or a convergence analysis—leaving open whether the accuracy gains follow from the theory or are empirical. This is load-bearing for the 'theory-optimal' framing and the reported 39.1% gap reduction.

    Authors: We appreciate the referee highlighting the need for explicit verification to support the 'theory-optimal' claim. In the manuscript, we first model the quantization error in terms of outlier magnitudes and introduce the Flatness metric to capture the concentration of outliers across dimensions. From this, we derive a closed-form expression for the optimal diagonal transformation that minimizes Flatness. BDQ then realizes this optimum by learning bidirectional diagonal matrices whose entries are optimized to match the derived solution. To directly address the concern, we will add a new subsection (in the revised Section 4) that (i) computes the theoretical optimal diagonal values from the closed-form expression for representative layers, (ii) compares them quantitatively to the learned diagonal entries from BDQ training, and (iii) includes a convergence plot and analysis showing that the optimization procedure converges to the theoretical values. These additions will demonstrate that the reported accuracy improvements, including the 39.1% gap reduction, arise from realizing the derived optimum rather than from heuristic search alone. revision: yes

Circularity Check

1 steps flagged

Flatness metric defined to quantify outliers then used to derive theory-optimal solution realized by BDQ transformations

specific steps
  1. self definitional [Abstract]
    "we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers. Based on this, we derive the theoretical optimal solution with respect to Flatness. Building on these insights, we propose Bidirectional Diagonal Quantization (BDQ)... that effectively disperses outlier patterns through optimized matrix transformations."

    Flatness is defined within the paper to measure outlier distribution after the error-outlier modeling step; the theoretical optimum is then derived specifically w.r.t. this internal metric, and BDQ's learned diagonal transformations are presented as achieving that optimum. The 'theory-optimal' label therefore reduces to optimizing the paper's own constructed quantity rather than an externally validated target.

full rationale

The paper models quantization error vs. outliers, introduces Flatness as a new distribution metric, derives a theoretical optimum w.r.t. Flatness, and asserts that learned bidirectional diagonal matrices in BDQ realize this optimum. This chain is self-contained but carries moderate circularity risk because the claimed theory-optimal result is constructed directly from the paper's own newly defined metric and modeling assumptions rather than an independent external benchmark or closed-form result shown to be achieved exactly by the practical method. No self-citations or fitted predictions are load-bearing in the abstract, but the link between derivation and empirical gains (e.g., <1% drop) depends on the transformations converging to the internal optimum by design.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the newly introduced Flatness metric as a faithful quantifier of outlier impact and on the assumption that diagonal transformations suffice to achieve the derived optimum. No explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption Quantization error can be mathematically related to the distribution of activation outliers in a way that permits derivation of an optimal transformation.
    Stated in the abstract as the basis for modeling and deriving the theoretical solution.
invented entities (1)
  • Flatness metric no independent evidence
    purpose: To quantify the distribution and concentration of outliers after linear transformations.
    Newly introduced in the paper to support the theoretical derivation and BDQ design.

pith-pipeline@v0.9.0 · 5780 in / 1350 out tokens · 43973 ms · 2026-05-20T22:34:08.988362+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We propose Flatness F=∑ W_ij²/(α_i β_j) ln(...) ; min F s.t. ∑ W_ij²/(α_i β_j)=1 and energy constraint. Lagrange yields ∂L/∂α_k=0, ∂L/∂β_l=0 implying row independence and column independence, so optimal V=d1 W d2 with diagonal d1=diag(√α_i), d2=diag(√β_j).

  • IndisputableMonolith/Foundation/BranchSelection.lean branch_selection refines
    ?
    refines

    Relation between the paper passage and the cited Recognition theorem.

    Bidirectional Diagonal Quantization (BDQ) ... two learnable diagonal transformation pairs ... theoretically demonstrate that this formulation can achieve the optimal solution with respect to Flatness.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 12 internal anchors

  1. [1]

    A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset

    A systematic classification of knowledge, reasoning, and context within the arc dataset.arXiv preprint arXiv:1806.00358. Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher M De Sa

  2. [2]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Bider- man, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Haile...

  3. [3]

    The Llama 3 Herd of Models

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, and 1 others

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948. Song Han, Huizi Mao, and William J Dally

  5. [5]

    Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

    Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman cod- ing.arXiv preprint arXiv:1510.00149. Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model com- pression with weighted low-rank factorization. In International Conference on Learning Representa- tions. Xing Hu, Yuan...

  6. [6]

    arXiv preprint arXiv:2501.13987

    Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. arXiv preprint arXiv:2501.13987. Xing Hu, Yuan Cheng, Dawei Yang, Zhihang Yuan, Jiangyong Yu, Chen Xu, and Sifan Zhou

  7. [7]

    Yoon Kim and Alexander M Rush

    I-llm: Efficient integer-only inference for fully-quantized low-bit large language models.arXiv preprint arXiv:2405.17849. Yoon Kim and Alexander M Rush

  8. [8]

    InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327

    Sequence- level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327. Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park

  9. [9]

    Owq: Lessons learned from activation outliers for weight quanti- zation in large language models.arXiv preprint arXiv:2306.02272,

  10. [10]

    SpinQuant: LLM quantization with learned rotations

    Spinquant: Llm quan- tization with learned rotations.arXiv preprint arXiv:2405.16406. Ilya Loshchilov, Frank Hutter, and 1 others

  11. [11]

    Decoupled Weight Decay Regularization

    Fix- ing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5:5. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher

  12. [12]

    Pointer Sentinel Mixture Models

    Pointer sentinel mixture mod- els.arXiv preprint arXiv:1609.07843. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and 1 others

  13. [13]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Exploring the limits of transfer learning with a unified text-to-text trans- former.Preprint, arXiv:1910.10683. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- ula, and Yejin Choi

  14. [14]

    Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao

    Omniquant: Omnidirectionally calibrated quantization for large language models.arXiv preprint arXiv:2308.13137. Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao

  15. [15]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others

    Flatquant: Flatness matters for llm quantiza- tion.Preprint, arXiv:2410.09426. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others. 2023a. Llama: Open and ef- ficient foundation language models.arXiv preprint arXiv:2302.13971. Hugo Touv...

  16. [16]

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han

    Quip#: Even better llm quantization with hadamard in- coherence and lattice codebooks.arXiv preprint arXiv:2402.04396. Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han

  17. [17]

    Qwen2.5 Technical Report

    Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115. Chuanguang Yang, Zhulin An, Linhang Cai, and Yongjun Xu

  18. [18]

    IEEE transactions on neural networks and learning systems, 35(2):2094–2108

    Knowledge distillation using hierarchical self-supervision augmented distribution. IEEE transactions on neural networks and learning systems, 35(2):2094–2108. Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun

  19. [19]

    ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

    Asvd: Activation-aware singular value decomposition for compressing large language models.arXiv preprint arXiv:2312.05821. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi

  20. [20]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830. Kaiyu Zhang, Jinglong Chen, Shuilong He, Enyong Xu, Fudong Li, and Zitong Zhou

  21. [21]

    A Appendix: Difference from Previous Rotation Based Methods More clearly, we illustrate by setting counter examples

    A survey on model compression for large language models.Transactions of the Associa- tion for Computational Linguistics, 12:1556–1577. A Appendix: Difference from Previous Rotation Based Methods More clearly, we illustrate by setting counter examples. There exists an original matrix W∈ R4096×4096, which contains some outliers that are significantly larger...

  22. [22]

    and (Liu et al., 2024), the positions of these four transformation pairs are respectively in the < W q, Wk, Wv > matrices of Self-Attention, the < W output > matrix of Self- Attention, the < W gate, Wup > matrices of Feed- Forward Network, and the < W down > matrix of Feed-Forward Network. D Appendix: Complete Experimental Details Experimental Setup.We ap...

  23. [23]

    All experiments were conducted utilizing the GPTQ method for quantifi- cation

    and C4 test set. All experiments were conducted utilizing the GPTQ method for quantifi- cation. The quantitative baseline includes: Quarot (Ashkboos et al., 2025), Spinquant (Liu et al.,

  24. [24]

    Implementation Details.We utilize AdamW optimizer (Loshchilov et al.,

    and Flatquant (Sun et al., 2025). Implementation Details.We utilize AdamW optimizer (Loshchilov et al.,

  25. [25]

    BDQ is trained on an alignment dataset for 150 epochs, with the calibration set containing 128 sentences from Wiki- Text2, each containing 2048 tokens

    with an initial learning rate of 5e−3 and adopt a cosine annealing schedule for learning rate decay. BDQ is trained on an alignment dataset for 150 epochs, with the calibration set containing 128 sentences from Wiki- Text2, each containing 2048 tokens. The batch size is set to 4 and δ is set to 0.5. All diagonal matrices are initialized as identity matric...

  26. [26]

    The motivation for adding the rotation matrix R is to prevent the special case where the matrix W has strong column correlations

    G Appendix: The Reason for Adding the Rotation Matrix As we mentioned in Section 4.3, we obtained the optimal solution for Flatness, which is V= d1W d2. The motivation for adding the rotation matrix R is to prevent the special case where the matrix W has strong column correlations. The rota- tion matrix can, while retaining the ability of diago- nal scali...