Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos
Pith reviewed 2026-05-18 23:49 UTC · model grok-4.3
The pith
Non-asymptotic error bounds are derived for both OPTQ and Qronos post-training quantization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The iterative procedure of OPTQ induces quantization error whose accumulation admits explicit non-asymptotic 2-norm bounds that depend on the calibration data and regularization parameter; the stochastic variant yields stronger infinity-norm bounds that control the size of the required quantization alphabet. The same style of analysis is extended to Qronos, producing comparable bounds for its deterministic and stochastic forms that help account for its observed performance gains.
What carries the argument
The sequential, error-compensating update rule inside the OPTQ iteration, which subtracts the effect of each quantized weight from the residual before moving to the next feature.
If this is right
- The analysis justifies ordering features by decreasing norm as a way to keep accumulated error small.
- It supplies explicit guidance on selecting the regularization parameter to trade off quantization error against numerical stability.
- Infinity-norm bounds for the stochastic variant directly limit the dynamic range needed in the quantization alphabet for later layers and nonlinear activations.
- Comparable bounds for Qronos explain its empirical superiority over plain OPTQ on the same calibration sets.
Where Pith is reading between the lines
- The explicit dependence on calibration data suggests that bounds could be recomputed on-the-fly for new datasets to predict required bit-width.
- Similar error-tracking arguments might be applied to other greedy quantization schemes that update weights sequentially.
- The infinity-norm controls could be used to set per-layer bit allocations automatically rather than by hand.
Load-bearing premise
The quantization error induced at each step of the iteration can be bounded using the explicit algebraic form of the update and the fixed calibration data.
What would settle it
Compute the observed 2-norm quantization error on a held-out calibration batch for several values of the regularization parameter and check whether the measured error stays below the paper's derived bound.
read the original abstract
Post-training quantization (PTQ) has become a crucial tool for reducing the memory and compute costs of modern deep neural networks, including large language models (LLMs). Among PTQ algorithms, the OPTQ framework-also known as GPTQ-has emerged as a leading method due to its computational efficiency and strong empirical performance. Despite its widespread adoption, however, OPTQ lacks rigorous quantitative theoretical guarantees. This paper presents the first quantitative error bounds for both deterministic and stochastic variants of OPTQ, as well as for Qronos, a recent related state-of-the-art PTQ algorithm. We analyze how OPTQ's iterative procedure induces quantization error and derive non-asymptotic 2-norm error bounds that depend explicitly on the calibration data and a regularization parameter that OPTQ uses. Our analysis provides theoretical justification for several practical design choices, including the widely used heuristic of ordering features by decreasing norm, as well as guidance for selecting the regularization parameter. For the stochastic variant, we establish stronger infinity-norm error bounds, which enable control over the required quantization alphabet and are particularly useful for downstream layers and nonlinearities. Finally, we extend our analysis to Qronos, providing new theoretical bounds, for both its deterministic and stochastic variants, that help explain its empirical advantages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to derive the first non-asymptotic 2-norm and infinity-norm error bounds for the deterministic and stochastic variants of OPTQ (GPTQ) and for Qronos. The bounds are obtained directly from the iterative update rule of OPTQ, with explicit dependence on the calibration matrix X and the regularization parameter λ; the analysis is extended to the stochastic case and to Qronos, and is used to justify the common heuristic of ordering features by decreasing norm and to give guidance on choosing λ.
Significance. If the derivations hold, the work supplies the first quantitative theoretical guarantees for a widely used family of PTQ algorithms, moving the field from purely empirical validation toward principled analysis. The explicit dependence on calibration data and λ, together with the extension to stochastic and Qronos variants, is a concrete strength; the paper thereby offers both justification for existing practice and concrete guidance for implementation.
major comments (1)
- [§3.2, Theorem 1] §3.2, Theorem 1: the induction step assumes that the quantization error at step t is bounded independently of previous steps; the proof sketch should explicitly verify that the accumulated error term remains controlled by the same λ-dependent factor across all iterations, otherwise the claimed non-asymptotic bound may grow with the number of features.
minor comments (3)
- [Abstract] The abstract states that the bounds 'depend explicitly on the calibration data,' yet the dependence is only made precise in Eq. (8); adding a short sentence in the abstract that references the explicit form would improve readability.
- [Figure 2] Figure 2 caption: the plotted quantity is the infinity-norm bound for the stochastic variant, but the legend does not indicate whether the curves correspond to different values of λ or different calibration-set sizes; clarify the experimental setup.
- [§4.1] §4.1: the discussion of Qronos re-uses the same notation (X, λ) as OPTQ without re-defining the calibration matrix for the Qronos procedure; a brief reminder of the difference would prevent confusion.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and the constructive comment on the proof of Theorem 1. We address the concern below and will revise the manuscript to strengthen the presentation of the induction argument.
read point-by-point responses
-
Referee: [§3.2, Theorem 1] §3.2, Theorem 1: the induction step assumes that the quantization error at step t is bounded independently of previous steps; the proof sketch should explicitly verify that the accumulated error term remains controlled by the same λ-dependent factor across all iterations, otherwise the claimed non-asymptotic bound may grow with the number of features.
Authors: We appreciate the referee highlighting this aspect of the induction. In the proof of Theorem 1, the induction hypothesis maintains a uniform bound on the quantization error up to iteration t-1 that depends only on λ and the calibration matrix X (specifically, the error is controlled by a factor of the form O(max column norm / λ)). The OPTQ update rule at step t introduces a new error term that is likewise bounded by the same λ-dependent quantity; because the update subtracts the quantized contribution from the residual and the Hessian approximation prevents unbounded propagation, the total accumulated error at step t remains bounded by the same factor without linear growth in the number of features. The non-asymptotic nature of the bound follows directly from this recursive control. That said, the current proof sketch is somewhat terse on this verification. We will expand the induction argument in the revised §3.2 to explicitly walk through the error accumulation step and confirm that the λ-dependent factor continues to dominate across all iterations. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper derives non-asymptotic 2-norm and infinity-norm error bounds directly from the existing OPTQ iterative update rule, with explicit dependence on the calibration matrix X and regularization parameter lambda, then extends the same style of analysis to the stochastic variant and to Qronos. No load-bearing step reduces by construction to a fitted value, a self-citation chain, or an ansatz smuggled from prior work by the same authors; the bounds are presented as consequences of the given procedure and calibration data. The central claims therefore remain independent of the paper's own outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- regularization parameter
axioms (1)
- domain assumption The iterative procedure of OPTQ induces quantization error in a quantifiable way that permits non-asymptotic bounds.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proposition 3.2 … et = PX⊥≥t+1(w(t−1)t−qt)Xt + et−1 … ∥Xw−Xq∥2² = Σ|w(j−1)j−qj|²∥PX⊥≥j+1Xj∥2²
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 4.2 … Xw−Xq ≺cx N(0,Σ) … Σ⪯πδ²/2 max∥PX⊥≥j+1Xj∥²I
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
High-Rate Quantized Matrix Multiplication II
Waterfilling rate allocation makes quantized matrix multiplication for LLMs near information-theoretically optimal, with WaterSIC being basis-free and within 0.25 bits per entry of the limit.
-
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models
BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GS...
-
High-Rate Quantized Matrix Multiplication I
High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.
Reference graph
Works this paper leans on
- [1]
-
[2]
R. Alweiss, Y. P. Liu, and M. Sawhney , Discrepancy minimization via a self-balancing walk , in Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021, pp. 14– 20
work page 2021
-
[3]
S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoef- ler, and J. Hensman , Quarot: Outlier-free 4-bit inference in rotated llms , Advances in Neural Information Processing Systems, 37 (2024), pp. 100213–100240
work page 2024
-
[4]
J. Chee, Y. Cai, V. Kuleshov, and C. M. De Sa , Quip: 2-bit quantization of large language models with guarantees, Advances in Neural Information Processing Systems, 36 (2023), pp. 4396–4429
work page 2023
-
[5]
T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer , Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , Advances in neural information processing systems, 35 (2022), pp. 30318– 30332
work page 2022
-
[6]
Elias Frantar and Dan Alistarh
V. Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alis- tarh, Extreme compression of large language models via additive quantization , arXiv preprint arXiv:2401.06118, (2024)
-
[7]
S. Foucart and H. Rauhut, A mathematical introduction to compressive sensing, Applied and numerical harmonic analysis (, (2013)
work page 2013
- [8]
-
[9]
G. Franco, A. Pappalardo, and N. J. Fraser , Xilinx/brevitas, 2025, https://doi.org/10.5281/zenodo. 3333552, https://doi.org/10.5281/zenodo.3333552
-
[10]
E. Frantar and D. Alistarh , Optimal brain compression: A framework for accurate post-training quantization and pruning , Advances in Neural Information Processing Systems, 35 (2022), pp. 4475– 4488
work page 2022
-
[11]
E. Frantar and D. Alistarh , Sparsegpt: Massive language models can be accurately pruned in one-shot, in International Conference on Machine Learning, PMLR, 2023, pp. 10323–10337
work page 2023
-
[12]
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh , Gptq: Accurate post-training quantization for generative pre-trained transformers, (2022)
work page 2022
-
[13]
A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer , A survey of quanti- zation methods for efficient neural network inference , in Low-power computer vision, Chapman and Hall/CRC, 2022, pp. 291–326
work page 2022
-
[14]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. V aughan, et al. , The llama 3 herd of models , arXiv preprint arXiv:2407.21783, (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
B. Hassibi and D. Stork , Second order derivatives for network pruning: Optimal brain surgeon , Ad- vances in neural information processing systems, 5 (1992)
work page 1992
-
[16]
B. Hassibi, D. G. Stork, and G. J. Wolff , Optimal brain surgeon and general network pruning , in IEEE international conference on neural networks, IEEE, 1993, pp. 293–299
work page 1993
-
[17]
B. Hassibi and H. Vikalo , On the expected complexity of integer least-squares problems , in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, IEEE, 2002, pp. II–1497
work page 2002
-
[18]
I. Hubara, Y. Nahshan, Y. Hanani, R. Banner, and D. Soudry , Accurate post training quantization with small calibration sets , in Proceedings of the 38th International Conference on Machine Learning, M. Meila and T. Zhang, eds., vol. 139 of Proceedings of Machine Learning Research, PMLR, 18–24 Jul 2021, pp. 4466–4475, https://proceedings.mlr.press/v139/hu...
work page 2021
- [19]
-
[20]
https://github.com/ist-daslab/gptq, 2022
IST-DASLab, gptq. https://github.com/ist-daslab/gptq, 2022
work page 2022
-
[21]
B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko , Quantization and training of neural networks for efficient integer-arithmetic-only inference , in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
work page 2018
-
[22]
J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. W ang, G. Xiao, X. Dang, C. Gan, and S. Han , Awq: Activation-aware weight quantization for on-device llm compression and acceleration , Proceedings of Machine Learning and Systems, 6 (2024), pp. 87–100
work page 2024
-
[23]
Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, and T. Blankevoort , Spinquant: LLM quantization with learned rotations , in The Thir- teenth International Conference on Learning Representations, 2025, https://openreview.net/forum? id=ogO6DGE6FZ
work page 2025
-
[24]
E. Lybrand and R. Saab , A greedy algorithm for quantizing neural networks , Journal of Machine Learning Research, 22 (2021), pp. 1–38
work page 2021
-
[25]
https://github.com/modelcloud/ gptqmodel, 2024
ModelCloud.ai and qubitium@modelcloud.ai , Gptqmodel. https://github.com/modelcloud/ gptqmodel, 2024. Contact: qubitium@modelcloud.ai
work page 2024
- [26]
- [27]
-
[28]
W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y. Qiao, and P. Luo , Omniquant: Omnidirectionally calibrated quantization for large language models , in The Twelfth International Conference on Learning Representations, 2024, https://openreview.net/forum? id=8Wuvhh0LYW
work page 2024
- [29]
-
[30]
Strang , The discrete cosine transform , SIAM Review, 41 (1999), pp
G. Strang , The discrete cosine transform , SIAM Review, 41 (1999), pp. 135–147, https://doi.org/10. 1137/S0036144598336745, https://doi.org/10.1137/S0036144598336745, https://arxiv.org/abs/https: //doi.org/10.1137/S0036144598336745
-
[31]
A. Tseng, J. Chee, Q. Sun, V. Kuleshov, and C. De Sa , Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks, arXiv preprint arXiv:2402.04396, (2024)
-
[32]
H. Xi, C. Li, J. Chen, and J. Zhu , Training transformers with 4-bit integers , Advances in Neural Information Processing Systems, 36 (2023), pp. 49146–49168
work page 2023
-
[33]
G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han , Smoothquant: Accurate and efficient post-training quantization for large language models, in International Conference on Machine Learning, PMLR, 2023, pp. 38087–38099
work page 2023
- [34]
-
[35]
Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He , Zeroquant: Efficient and affordable post-training quantization for large-scale transformers , in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds., vol. 35, Curran Associates, Inc., 2022, pp. 27168–27183, https://proce...
work page 2022
- [36]
- [37]
-
[38]
H. Zhang and R. Saab , Unified stochastic framework for neural network quantization and pruning , Ap- plied and Computational Harmonic Analysis, 79 (2025), p. 101778, https://doi.org/https://doi.org/ 10.1016/j.acha.2025.101778, https://www.sciencedirect.com/science/article/pii/S1063520325000326
-
[39]
J. Zhang and R. Saab , Spfq: A stochastic algorithm and its error analysis for neural network quanti- 24 H. ZHANG, S. ZHANG, I. COLBERT, R. SAAB zation, arXiv preprint arXiv:2309.10975, (2023)
- [40]
-
[41]
S. Zhang and R. Saab , Theoretical guarantees for low-rank compression of deep neural networks , arXiv preprint arXiv:2502.02766, (2025)
- [42]
- [43]
-
[44]
X. Zhu, J. Li, Y. Liu, C. Ma, and W. W ang , A survey on model compression for large language models, arXiv preprint arXiv:2308.07633, (2023). THEORETICAL ANALYSIS OF OPTQ AND QRONOS 25 Appendix A. Proof of lemmas for OPTQ Error Analysis. The following lemma from [42] shows that the OPTQ update can be interpreted as the optimal adjustment of the remaining...
-
[45]
As a result, ∥Σ∥op = maxj ∥vj∥2 2 = maxj ∥PX ⊥ ≥j+1 Xj∥2. This completes the proof. 28 H. ZHANG, S. ZHANG, I. COLBERT, R. SAAB Appendix B. Auxiliary Lemmas. Lemma B.1. Suppose X ∈ Rm×N. Let bX be the matrix X√ λI and σ(j) min be the smallest singular value of X≥j+1. Then ∥P bX ⊥ ≥j+1 bXj∥2 2 ≤ λ (σ(j) min)2+λ · ∥Xj∥2 2 + λ when m ≤ N − j ∥Xj∥2 2 + λ...
-
[46]
(Lemma 2.4 in [2]) If X ≺cx Y , then for any linear transformation M on Rn, we have M X ≺cx M Y
-
[47]
(Lemma A.2 in [39]) If A and B are two positive semi-definite matrices and A ⪯ B, then N (0, A) ≺cx N (0, B)
-
[48]
Let U and V live on the same probability space, and letE and F be independent
(Lemma 2.5 in [2]) Consider random vectors U, V , E, and F . Let U and V live on the same probability space, and letE and F be independent. Suppose that U ≺cx E and (V −U)|U ≺cx F . Then V ≺cx E + F
-
[49]
(Lemma 2.6 in [2]) Let X be a real-valued random variable with EX = 0 and |X| ≤ C. Then X ≺cx N 0, πC 2 2
-
[50]
Then P ∥X − µ∥∞ ≤ α ≥ 1 − √ 2ne− α2 4σ2
(Lemma B.2 in [39]) Let X be an n-dimensional random vector such that X ≺cx N (µ, σ2I), and let α > 0. Then P ∥X − µ∥∞ ≤ α ≥ 1 − √ 2ne− α2 4σ2 . Appendix D. An Adversarial Construction for OPTQ. Here, we construct a matrix X and vector w so that OPTQ with a infinite alphabet results in ∥X(w − q)∥∞ = ∥X(w − q)∥2 = O( √ N), and also ∥q∥∞ = O(N), despite hav...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.