pith. sign in

arxiv: 2508.04853 · v2 · submitted 2025-08-06 · 💻 cs.LG · cs.AI· cs.IT· cs.NA· math.IT· math.NA

Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

Pith reviewed 2026-05-18 23:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITcs.NAmath.ITmath.NA
keywords post-training quantizationOPTQGPTQQronoserror boundslarge language modelstheoretical analysisquantization error
0
0 comments X

The pith

Non-asymptotic error bounds are derived for both OPTQ and Qronos post-training quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the first quantitative non-asymptotic error bounds for the deterministic and stochastic versions of OPTQ, as well as for the related Qronos algorithm. These bounds are expressed directly in terms of the calibration data matrix and the regularization parameter that controls the quantization update. A sympathetic reader would care because the analysis supplies theoretical reasons for practical heuristics such as sorting features by decreasing norm and for choosing the regularization strength, while also giving infinity-norm controls that are useful for layers followed by nonlinearities.

Core claim

The iterative procedure of OPTQ induces quantization error whose accumulation admits explicit non-asymptotic 2-norm bounds that depend on the calibration data and regularization parameter; the stochastic variant yields stronger infinity-norm bounds that control the size of the required quantization alphabet. The same style of analysis is extended to Qronos, producing comparable bounds for its deterministic and stochastic forms that help account for its observed performance gains.

What carries the argument

The sequential, error-compensating update rule inside the OPTQ iteration, which subtracts the effect of each quantized weight from the residual before moving to the next feature.

If this is right

  • The analysis justifies ordering features by decreasing norm as a way to keep accumulated error small.
  • It supplies explicit guidance on selecting the regularization parameter to trade off quantization error against numerical stability.
  • Infinity-norm bounds for the stochastic variant directly limit the dynamic range needed in the quantization alphabet for later layers and nonlinear activations.
  • Comparable bounds for Qronos explain its empirical superiority over plain OPTQ on the same calibration sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit dependence on calibration data suggests that bounds could be recomputed on-the-fly for new datasets to predict required bit-width.
  • Similar error-tracking arguments might be applied to other greedy quantization schemes that update weights sequentially.
  • The infinity-norm controls could be used to set per-layer bit allocations automatically rather than by hand.

Load-bearing premise

The quantization error induced at each step of the iteration can be bounded using the explicit algebraic form of the update and the fixed calibration data.

What would settle it

Compute the observed 2-norm quantization error on a held-out calibration batch for several values of the regularization parameter and check whether the measured error stays below the paper's derived bound.

read the original abstract

Post-training quantization (PTQ) has become a crucial tool for reducing the memory and compute costs of modern deep neural networks, including large language models (LLMs). Among PTQ algorithms, the OPTQ framework-also known as GPTQ-has emerged as a leading method due to its computational efficiency and strong empirical performance. Despite its widespread adoption, however, OPTQ lacks rigorous quantitative theoretical guarantees. This paper presents the first quantitative error bounds for both deterministic and stochastic variants of OPTQ, as well as for Qronos, a recent related state-of-the-art PTQ algorithm. We analyze how OPTQ's iterative procedure induces quantization error and derive non-asymptotic 2-norm error bounds that depend explicitly on the calibration data and a regularization parameter that OPTQ uses. Our analysis provides theoretical justification for several practical design choices, including the widely used heuristic of ordering features by decreasing norm, as well as guidance for selecting the regularization parameter. For the stochastic variant, we establish stronger infinity-norm error bounds, which enable control over the required quantization alphabet and are particularly useful for downstream layers and nonlinearities. Finally, we extend our analysis to Qronos, providing new theoretical bounds, for both its deterministic and stochastic variants, that help explain its empirical advantages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper claims to derive the first non-asymptotic 2-norm and infinity-norm error bounds for the deterministic and stochastic variants of OPTQ (GPTQ) and for Qronos. The bounds are obtained directly from the iterative update rule of OPTQ, with explicit dependence on the calibration matrix X and the regularization parameter λ; the analysis is extended to the stochastic case and to Qronos, and is used to justify the common heuristic of ordering features by decreasing norm and to give guidance on choosing λ.

Significance. If the derivations hold, the work supplies the first quantitative theoretical guarantees for a widely used family of PTQ algorithms, moving the field from purely empirical validation toward principled analysis. The explicit dependence on calibration data and λ, together with the extension to stochastic and Qronos variants, is a concrete strength; the paper thereby offers both justification for existing practice and concrete guidance for implementation.

major comments (1)
  1. [§3.2, Theorem 1] §3.2, Theorem 1: the induction step assumes that the quantization error at step t is bounded independently of previous steps; the proof sketch should explicitly verify that the accumulated error term remains controlled by the same λ-dependent factor across all iterations, otherwise the claimed non-asymptotic bound may grow with the number of features.
minor comments (3)
  1. [Abstract] The abstract states that the bounds 'depend explicitly on the calibration data,' yet the dependence is only made precise in Eq. (8); adding a short sentence in the abstract that references the explicit form would improve readability.
  2. [Figure 2] Figure 2 caption: the plotted quantity is the infinity-norm bound for the stochastic variant, but the legend does not indicate whether the curves correspond to different values of λ or different calibration-set sizes; clarify the experimental setup.
  3. [§4.1] §4.1: the discussion of Qronos re-uses the same notation (X, λ) as OPTQ without re-defining the calibration matrix for the Qronos procedure; a brief reminder of the difference would prevent confusion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the constructive comment on the proof of Theorem 1. We address the concern below and will revise the manuscript to strengthen the presentation of the induction argument.

read point-by-point responses
  1. Referee: [§3.2, Theorem 1] §3.2, Theorem 1: the induction step assumes that the quantization error at step t is bounded independently of previous steps; the proof sketch should explicitly verify that the accumulated error term remains controlled by the same λ-dependent factor across all iterations, otherwise the claimed non-asymptotic bound may grow with the number of features.

    Authors: We appreciate the referee highlighting this aspect of the induction. In the proof of Theorem 1, the induction hypothesis maintains a uniform bound on the quantization error up to iteration t-1 that depends only on λ and the calibration matrix X (specifically, the error is controlled by a factor of the form O(max column norm / λ)). The OPTQ update rule at step t introduces a new error term that is likewise bounded by the same λ-dependent quantity; because the update subtracts the quantized contribution from the residual and the Hessian approximation prevents unbounded propagation, the total accumulated error at step t remains bounded by the same factor without linear growth in the number of features. The non-asymptotic nature of the bound follows directly from this recursive control. That said, the current proof sketch is somewhat terse on this verification. We will expand the induction argument in the revised §3.2 to explicitly walk through the error accumulation step and confirm that the λ-dependent factor continues to dominate across all iterations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives non-asymptotic 2-norm and infinity-norm error bounds directly from the existing OPTQ iterative update rule, with explicit dependence on the calibration matrix X and regularization parameter lambda, then extends the same style of analysis to the stochastic variant and to Qronos. No load-bearing step reduces by construction to a fitted value, a self-citation chain, or an ansatz smuggled from prior work by the same authors; the bounds are presented as consequences of the given procedure and calibration data. The central claims therefore remain independent of the paper's own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existing OPTQ iterative procedure and the assumption that its induced error can be bounded non-asymptotically using calibration data and the regularization parameter already present in the method.

free parameters (1)
  • regularization parameter
    Bounds depend explicitly on this parameter; its value is part of the practical design choices the analysis aims to justify.
axioms (1)
  • domain assumption The iterative procedure of OPTQ induces quantization error in a quantifiable way that permits non-asymptotic bounds.
    This premise is invoked to derive the error bounds from calibration data.

pith-pipeline@v0.9.0 · 5770 in / 1190 out tokens · 74725 ms · 2026-05-18T23:49:02.982359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. High-Rate Quantized Matrix Multiplication II

    cs.LG 2026-05 unverdicted novelty 6.0

    Waterfilling rate allocation makes quantized matrix multiplication for LLMs near information-theoretically optimal, with WaterSIC being basis-free and within 0.25 bits per entry of the limit.

  2. BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

    cs.LG 2026-02 unverdicted novelty 5.0

    BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GS...

  3. High-Rate Quantized Matrix Multiplication I

    cs.IT 2026-01 unverdicted novelty 5.0

    High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 3 Pith papers · 1 internal anchor

  1. [1]

    Adepu, Z

    H. Adepu, Z. Zeng, L. Zhang, and V. Singh , Framequant: Flexible low-bit quantization for transform- ers, arXiv preprint arXiv:2403.06082, (2024)

  2. [2]

    Alweiss, Y

    R. Alweiss, Y. P. Liu, and M. Sawhney , Discrepancy minimization via a self-balancing walk , in Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021, pp. 14– 20

  3. [3]

    Ashkboos, A

    S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoef- ler, and J. Hensman , Quarot: Outlier-free 4-bit inference in rotated llms , Advances in Neural Information Processing Systems, 37 (2024), pp. 100213–100240

  4. [4]

    J. Chee, Y. Cai, V. Kuleshov, and C. M. De Sa , Quip: 2-bit quantization of large language models with guarantees, Advances in Neural Information Processing Systems, 36 (2023), pp. 4396–4429

  5. [5]

    Dettmers, M

    T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer , Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , Advances in neural information processing systems, 35 (2022), pp. 30318– 30332

  6. [6]

    Elias Frantar and Dan Alistarh

    V. Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alis- tarh, Extreme compression of large language models via additive quantization , arXiv preprint arXiv:2401.06118, (2024)

  7. [7]

    Foucart and H

    S. Foucart and H. Rauhut, A mathematical introduction to compressive sensing, Applied and numerical harmonic analysis (, (2013)

  8. [8]

    Franco, P

    G. Franco, P. Monteagudo-Lago, I. Colbert, N. Fraser, and M. Blott , Improving quantization with post-training model expansion, arXiv preprint arXiv:2503.17513, (2025)

  9. [9]

    librosa/librosa: 0.6.3,

    G. Franco, A. Pappalardo, and N. J. Fraser , Xilinx/brevitas, 2025, https://doi.org/10.5281/zenodo. 3333552, https://doi.org/10.5281/zenodo.3333552

  10. [10]

    Frantar and D

    E. Frantar and D. Alistarh , Optimal brain compression: A framework for accurate post-training quantization and pruning , Advances in Neural Information Processing Systems, 35 (2022), pp. 4475– 4488

  11. [11]

    Frantar and D

    E. Frantar and D. Alistarh , Sparsegpt: Massive language models can be accurately pruned in one-shot, in International Conference on Machine Learning, PMLR, 2023, pp. 10323–10337

  12. [12]

    Frantar, S

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh , Gptq: Accurate post-training quantization for generative pre-trained transformers, (2022)

  13. [13]

    Gholami, S

    A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer , A survey of quanti- zation methods for efficient neural network inference , in Low-power computer vision, Chapman and Hall/CRC, 2022, pp. 291–326

  14. [14]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. V aughan, et al. , The llama 3 herd of models , arXiv preprint arXiv:2407.21783, (2024)

  15. [15]

    Hassibi and D

    B. Hassibi and D. Stork , Second order derivatives for network pruning: Optimal brain surgeon , Ad- vances in neural information processing systems, 5 (1992)

  16. [16]

    Hassibi, D

    B. Hassibi, D. G. Stork, and G. J. Wolff , Optimal brain surgeon and general network pruning , in IEEE international conference on neural networks, IEEE, 1993, pp. 293–299

  17. [17]

    Hassibi and H

    B. Hassibi and H. Vikalo , On the expected complexity of integer least-squares problems , in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, IEEE, 2002, pp. II–1497

  18. [18]

    Hubara, Y

    I. Hubara, Y. Nahshan, Y. Hanani, R. Banner, and D. Soudry , Accurate post training quantization with small calibration sets , in Proceedings of the 38th International Conference on Machine Learning, M. Meila and T. Zhang, eds., vol. 139 of Proceedings of Machine Learning Research, PMLR, 18–24 Jul 2021, pp. 4466–4475, https://proceedings.mlr.press/v139/hu...

  19. [19]

    M. Huh, H. Mobahi, R. Zhang, B. Cheung, P. Agrawal, and P. Isola , The low-rank simplicity bias in deep networks , arXiv preprint arXiv:2103.10427, (2021)

  20. [20]

    https://github.com/ist-daslab/gptq, 2022

    IST-DASLab, gptq. https://github.com/ist-daslab/gptq, 2022

  21. [21]

    Jacob, S

    B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko , Quantization and training of neural networks for efficient integer-arithmetic-only inference , in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

  22. [22]

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. W ang, G. Xiao, X. Dang, C. Gan, and S. Han , Awq: Activation-aware weight quantization for on-device llm compression and acceleration , Proceedings of Machine Learning and Systems, 6 (2024), pp. 87–100

  23. [23]

    Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, and T. Blankevoort , Spinquant: LLM quantization with learned rotations , in The Thir- teenth International Conference on Learning Representations, 2025, https://openreview.net/forum? id=ogO6DGE6FZ

  24. [24]

    Lybrand and R

    E. Lybrand and R. Saab , A greedy algorithm for quantizing neural networks , Journal of Machine Learning Research, 22 (2021), pp. 1–38

  25. [25]

    https://github.com/modelcloud/ gptqmodel, 2024

    ModelCloud.ai and qubitium@modelcloud.ai , Gptqmodel. https://github.com/modelcloud/ gptqmodel, 2024. Contact: qubitium@modelcloud.ai

  26. [26]

    Nagel, R

    M. Nagel, R. A. Amjad, M. V an Baalen, C. Louizos, and T. Blankevoort , Up or down? adaptive rounding for post-training quantization, in International conference on machine learning, PMLR, 2020, pp. 7197–7206

  27. [27]

    Nagel, M

    M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling , Data-free quantization through weight equalization and bias correction , in Proceedings of the IEEE/CVF international conference on com- puter vision, 2019, pp. 1325–1334

  28. [28]

    W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y. Qiao, and P. Luo , Omniquant: Omnidirectionally calibrated quantization for large language models , in The Twelfth International Conference on Learning Representations, 2024, https://openreview.net/forum? id=8Wuvhh0LYW

  29. [29]

    C. Shi, H. Yang, D. Cai, Z. Zhang, Y. W ang, Y. Yang, and W. Lam , A thorough examination of decoding methods in the era of llms , arXiv preprint arXiv:2402.06925, (2024)

  30. [30]

    Strang , The discrete cosine transform , SIAM Review, 41 (1999), pp

    G. Strang , The discrete cosine transform , SIAM Review, 41 (1999), pp. 135–147, https://doi.org/10. 1137/S0036144598336745, https://doi.org/10.1137/S0036144598336745, https://arxiv.org/abs/https: //doi.org/10.1137/S0036144598336745

  31. [31]

    Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

    A. Tseng, J. Chee, Q. Sun, V. Kuleshov, and C. De Sa , Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks, arXiv preprint arXiv:2402.04396, (2024)

  32. [32]

    H. Xi, C. Li, J. Chen, and J. Zhu , Training transformers with 4-bit integers , Advances in Neural Information Processing Systems, 36 (2023), pp. 49146–49168

  33. [33]

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han , Smoothquant: Accurate and efficient post-training quantization for large language models, in International Conference on Machine Learning, PMLR, 2023, pp. 38087–38099

  34. [34]

    Xu and J

    C. Xu and J. McAuley , A survey on model compression and acceleration for pretrained language models, in Proceedings of the AAAI Conference on Artificial Intelligence, 2023

  35. [35]

    Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He , Zeroquant: Efficient and affordable post-training quantization for large-scale transformers , in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds., vol. 35, Curran Associates, Inc., 2022, pp. 27168–27183, https://proce...

  36. [36]

    Zhang, N

    A. Zhang, N. W ang, Y. Deng, X. Li, Z. Yang, and P. Yin , Magr: Weight magnitude reduction for enhancing post-training quantization, arXiv preprint arXiv:2406.00800, (2024)

  37. [37]

    Zhang, J

    C. Zhang, J. T. Wong, C. Xiao, G. A. Constantinides, and Y. Zhao , Qera: an analytical framework for quantization error reconstruction, arXiv preprint arXiv:2410.06040, (2024)

  38. [38]

    Zhang and R

    H. Zhang and R. Saab , Unified stochastic framework for neural network quantization and pruning , Ap- plied and Computational Harmonic Analysis, 79 (2025), p. 101778, https://doi.org/https://doi.org/ 10.1016/j.acha.2025.101778, https://www.sciencedirect.com/science/article/pii/S1063520325000326

  39. [39]

    Zhang and R

    J. Zhang and R. Saab , Spfq: A stochastic algorithm and its error analysis for neural network quanti- 24 H. ZHANG, S. ZHANG, I. COLBERT, R. SAAB zation, arXiv preprint arXiv:2309.10975, (2023)

  40. [40]

    Zhang, Y

    J. Zhang, Y. Zhou, and R. Saab , Post-training quantization for neural networks with provable guar- antees, SIAM Journal on Mathematics of Data Science, 5 (2023), pp. 373–399

  41. [41]

    Zhang and R

    S. Zhang and R. Saab , Theoretical guarantees for low-rank compression of deep neural networks , arXiv preprint arXiv:2502.02766, (2025)

  42. [42]

    Zhang, H

    S. Zhang, H. Zhang, I. Colbert, and R. Saab , Qronos: Correcting the past by shaping the future... in post-training quantization, arXiv preprint arXiv:2505.11695, (2025)

  43. [43]

    Zhang, I

    X. Zhang, I. Colbert, and S. Das , Learning low-precision structured subnetworks using joint layerwise channel pruning and uniform quantization , Applied Sciences, 12 (2022), p. 7829

  44. [44]

    X. Zhu, J. Li, Y. Liu, C. Ma, and W. W ang , A survey on model compression for large language models, arXiv preprint arXiv:2308.07633, (2023). THEORETICAL ANALYSIS OF OPTQ AND QRONOS 25 Appendix A. Proof of lemmas for OPTQ Error Analysis. The following lemma from [42] shows that the OPTQ update can be interpreted as the optimal adjustment of the remaining...

  45. [45]

    This completes the proof

    As a result, ∥Σ∥op = maxj ∥vj∥2 2 = maxj ∥PX ⊥ ≥j+1 Xj∥2. This completes the proof. 28 H. ZHANG, S. ZHANG, I. COLBERT, R. SAAB Appendix B. Auxiliary Lemmas. Lemma B.1. Suppose X ∈ Rm×N. Let bX be the matrix X√ λI and σ(j) min be the smallest singular value of X≥j+1. Then ∥P bX ⊥ ≥j+1 bXj∥2 2 ≤    λ (σ(j) min)2+λ · ∥Xj∥2 2 + λ when m ≤ N − j ∥Xj∥2 2 + λ...

  46. [46]

    (Lemma 2.4 in [2]) If X ≺cx Y , then for any linear transformation M on Rn, we have M X ≺cx M Y

  47. [47]

    (Lemma A.2 in [39]) If A and B are two positive semi-definite matrices and A ⪯ B, then N (0, A) ≺cx N (0, B)

  48. [48]

    Let U and V live on the same probability space, and letE and F be independent

    (Lemma 2.5 in [2]) Consider random vectors U, V , E, and F . Let U and V live on the same probability space, and letE and F be independent. Suppose that U ≺cx E and (V −U)|U ≺cx F . Then V ≺cx E + F

  49. [49]

    Then X ≺cx N 0, πC 2 2

    (Lemma 2.6 in [2]) Let X be a real-valued random variable with EX = 0 and |X| ≤ C. Then X ≺cx N 0, πC 2 2

  50. [50]

    Then P ∥X − µ∥∞ ≤ α ≥ 1 − √ 2ne− α2 4σ2

    (Lemma B.2 in [39]) Let X be an n-dimensional random vector such that X ≺cx N (µ, σ2I), and let α > 0. Then P ∥X − µ∥∞ ≤ α ≥ 1 − √ 2ne− α2 4σ2 . Appendix D. An Adversarial Construction for OPTQ. Here, we construct a matrix X and vector w so that OPTQ with a infinite alphabet results in ∥X(w − q)∥∞ = ∥X(w − q)∥2 = O( √ N), and also ∥q∥∞ = O(N), despite hav...