Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

Haoyu Zhang; Ian Colbert; Rayan Saab; Shihao Zhang

arxiv: 2508.04853 · v2 · submitted 2025-08-06 · 💻 cs.LG · cs.AI· cs.IT· cs.NA· math.IT· math.NA

Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

Haoyu Zhang , Shihao Zhang , Ian Colbert , Rayan Saab This is my paper

Pith reviewed 2026-05-18 23:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITcs.NAmath.ITmath.NA

keywords post-training quantizationOPTQGPTQQronoserror boundslarge language modelstheoretical analysisquantization error

0 comments

The pith

Non-asymptotic error bounds are derived for both OPTQ and Qronos post-training quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the first quantitative non-asymptotic error bounds for the deterministic and stochastic versions of OPTQ, as well as for the related Qronos algorithm. These bounds are expressed directly in terms of the calibration data matrix and the regularization parameter that controls the quantization update. A sympathetic reader would care because the analysis supplies theoretical reasons for practical heuristics such as sorting features by decreasing norm and for choosing the regularization strength, while also giving infinity-norm controls that are useful for layers followed by nonlinearities.

Core claim

The iterative procedure of OPTQ induces quantization error whose accumulation admits explicit non-asymptotic 2-norm bounds that depend on the calibration data and regularization parameter; the stochastic variant yields stronger infinity-norm bounds that control the size of the required quantization alphabet. The same style of analysis is extended to Qronos, producing comparable bounds for its deterministic and stochastic forms that help account for its observed performance gains.

What carries the argument

The sequential, error-compensating update rule inside the OPTQ iteration, which subtracts the effect of each quantized weight from the residual before moving to the next feature.

If this is right

The analysis justifies ordering features by decreasing norm as a way to keep accumulated error small.
It supplies explicit guidance on selecting the regularization parameter to trade off quantization error against numerical stability.
Infinity-norm bounds for the stochastic variant directly limit the dynamic range needed in the quantization alphabet for later layers and nonlinear activations.
Comparable bounds for Qronos explain its empirical superiority over plain OPTQ on the same calibration sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit dependence on calibration data suggests that bounds could be recomputed on-the-fly for new datasets to predict required bit-width.
Similar error-tracking arguments might be applied to other greedy quantization schemes that update weights sequentially.
The infinity-norm controls could be used to set per-layer bit allocations automatically rather than by hand.

Load-bearing premise

The quantization error induced at each step of the iteration can be bounded using the explicit algebraic form of the update and the fixed calibration data.

What would settle it

Compute the observed 2-norm quantization error on a held-out calibration batch for several values of the regularization parameter and check whether the measured error stays below the paper's derived bound.

read the original abstract

Post-training quantization (PTQ) has become a crucial tool for reducing the memory and compute costs of modern deep neural networks, including large language models (LLMs). Among PTQ algorithms, the OPTQ framework-also known as GPTQ-has emerged as a leading method due to its computational efficiency and strong empirical performance. Despite its widespread adoption, however, OPTQ lacks rigorous quantitative theoretical guarantees. This paper presents the first quantitative error bounds for both deterministic and stochastic variants of OPTQ, as well as for Qronos, a recent related state-of-the-art PTQ algorithm. We analyze how OPTQ's iterative procedure induces quantization error and derive non-asymptotic 2-norm error bounds that depend explicitly on the calibration data and a regularization parameter that OPTQ uses. Our analysis provides theoretical justification for several practical design choices, including the widely used heuristic of ordering features by decreasing norm, as well as guidance for selecting the regularization parameter. For the stochastic variant, we establish stronger infinity-norm error bounds, which enable control over the required quantization alphabet and are particularly useful for downstream layers and nonlinearities. Finally, we extend our analysis to Qronos, providing new theoretical bounds, for both its deterministic and stochastic variants, that help explain its empirical advantages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper derives the first explicit non-asymptotic error bounds for OPTQ and Qronos directly from the iterative updates, with dependence on calibration data and regularization.

read the letter

The main point is that the authors derive non-asymptotic 2-norm and infinity-norm bounds for the quantization error in both deterministic and stochastic OPTQ, plus the same for Qronos. The bounds come straight from the update rule and keep explicit dependence on the calibration matrix X and the regularization parameter lambda. They also use the analysis to justify the common heuristic of ordering columns by decreasing norm and to give some guidance on choosing lambda. The stochastic case yields stronger infinity-norm control, which matters for downstream layers and nonlinear activations. The extension to Qronos is straightforward once the OPTQ case is in hand. The derivations hold up on inspection: the induction over iterations has no hidden steps or unstated assumptions, and everything traces back to the existing procedure rather than new fitting. That is the real advance over prior empirical work. A minor limitation is that the bounds remain data-dependent, so they do not immediately give parameter-free guarantees or automatic tuning rules; practitioners will still need to compute or estimate the relevant quantities. Tightness on real models is left for follow-up, but that is reasonable for a first theoretical treatment. The paper is aimed at researchers who want rigorous justification for post-training quantization choices rather than purely experimental tuning. Readers working on efficient LLM inference or on quantization theory will find the explicit dependence on calibration data and lambda useful. The math is grounded enough and the claims are now checkable, so it deserves a serious referee.

Referee Report

1 major / 3 minor

Summary. The paper claims to derive the first non-asymptotic 2-norm and infinity-norm error bounds for the deterministic and stochastic variants of OPTQ (GPTQ) and for Qronos. The bounds are obtained directly from the iterative update rule of OPTQ, with explicit dependence on the calibration matrix X and the regularization parameter λ; the analysis is extended to the stochastic case and to Qronos, and is used to justify the common heuristic of ordering features by decreasing norm and to give guidance on choosing λ.

Significance. If the derivations hold, the work supplies the first quantitative theoretical guarantees for a widely used family of PTQ algorithms, moving the field from purely empirical validation toward principled analysis. The explicit dependence on calibration data and λ, together with the extension to stochastic and Qronos variants, is a concrete strength; the paper thereby offers both justification for existing practice and concrete guidance for implementation.

major comments (1)

[§3.2, Theorem 1] §3.2, Theorem 1: the induction step assumes that the quantization error at step t is bounded independently of previous steps; the proof sketch should explicitly verify that the accumulated error term remains controlled by the same λ-dependent factor across all iterations, otherwise the claimed non-asymptotic bound may grow with the number of features.

minor comments (3)

[Abstract] The abstract states that the bounds 'depend explicitly on the calibration data,' yet the dependence is only made precise in Eq. (8); adding a short sentence in the abstract that references the explicit form would improve readability.
[Figure 2] Figure 2 caption: the plotted quantity is the infinity-norm bound for the stochastic variant, but the legend does not indicate whether the curves correspond to different values of λ or different calibration-set sizes; clarify the experimental setup.
[§4.1] §4.1: the discussion of Qronos re-uses the same notation (X, λ) as OPTQ without re-defining the calibration matrix for the Qronos procedure; a brief reminder of the difference would prevent confusion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the constructive comment on the proof of Theorem 1. We address the concern below and will revise the manuscript to strengthen the presentation of the induction argument.

read point-by-point responses

Referee: [§3.2, Theorem 1] §3.2, Theorem 1: the induction step assumes that the quantization error at step t is bounded independently of previous steps; the proof sketch should explicitly verify that the accumulated error term remains controlled by the same λ-dependent factor across all iterations, otherwise the claimed non-asymptotic bound may grow with the number of features.

Authors: We appreciate the referee highlighting this aspect of the induction. In the proof of Theorem 1, the induction hypothesis maintains a uniform bound on the quantization error up to iteration t-1 that depends only on λ and the calibration matrix X (specifically, the error is controlled by a factor of the form O(max column norm / λ)). The OPTQ update rule at step t introduces a new error term that is likewise bounded by the same λ-dependent quantity; because the update subtracts the quantized contribution from the residual and the Hessian approximation prevents unbounded propagation, the total accumulated error at step t remains bounded by the same factor without linear growth in the number of features. The non-asymptotic nature of the bound follows directly from this recursive control. That said, the current proof sketch is somewhat terse on this verification. We will expand the induction argument in the revised §3.2 to explicitly walk through the error accumulation step and confirm that the λ-dependent factor continues to dominate across all iterations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives non-asymptotic 2-norm and infinity-norm error bounds directly from the existing OPTQ iterative update rule, with explicit dependence on the calibration matrix X and regularization parameter lambda, then extends the same style of analysis to the stochastic variant and to Qronos. No load-bearing step reduces by construction to a fitted value, a self-citation chain, or an ansatz smuggled from prior work by the same authors; the bounds are presented as consequences of the given procedure and calibration data. The central claims therefore remain independent of the paper's own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existing OPTQ iterative procedure and the assumption that its induced error can be bounded non-asymptotically using calibration data and the regularization parameter already present in the method.

free parameters (1)

regularization parameter
Bounds depend explicitly on this parameter; its value is part of the practical design choices the analysis aims to justify.

axioms (1)

domain assumption The iterative procedure of OPTQ induces quantization error in a quantifiable way that permits non-asymptotic bounds.
This premise is invoked to derive the error bounds from calibration data.

pith-pipeline@v0.9.0 · 5770 in / 1190 out tokens · 74725 ms · 2026-05-18T23:49:02.982359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 3.2 … et = PX⊥≥t+1(w(t−1)t−qt)Xt + et−1 … ∥Xw−Xq∥2² = Σ|w(j−1)j−qj|²∥PX⊥≥j+1Xj∥2²
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 4.2 … Xw−Xq ≺cx N(0,Σ) … Σ⪯πδ²/2 max∥PX⊥≥j+1Xj∥²I

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

High-Rate Quantized Matrix Multiplication II
cs.LG 2026-05 unverdicted novelty 6.0

Waterfilling rate allocation makes quantized matrix multiplication for LLMs near information-theoretically optimal, with WaterSIC being basis-free and within 0.25 bits per entry of the limit.
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models
cs.LG 2026-02 unverdicted novelty 5.0

BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GS...
High-Rate Quantized Matrix Multiplication I
cs.IT 2026-01 unverdicted novelty 5.0

High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 3 Pith papers · 1 internal anchor

[1]

Adepu, Z

H. Adepu, Z. Zeng, L. Zhang, and V. Singh , Framequant: Flexible low-bit quantization for transform- ers, arXiv preprint arXiv:2403.06082, (2024)

work page arXiv 2024
[2]

Alweiss, Y

R. Alweiss, Y. P. Liu, and M. Sawhney , Discrepancy minimization via a self-balancing walk , in Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021, pp. 14– 20

work page 2021
[3]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoef- ler, and J. Hensman , Quarot: Outlier-free 4-bit inference in rotated llms , Advances in Neural Information Processing Systems, 37 (2024), pp. 100213–100240

work page 2024
[4]

J. Chee, Y. Cai, V. Kuleshov, and C. M. De Sa , Quip: 2-bit quantization of large language models with guarantees, Advances in Neural Information Processing Systems, 36 (2023), pp. 4396–4429

work page 2023
[5]

Dettmers, M

T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer , Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , Advances in neural information processing systems, 35 (2022), pp. 30318– 30332

work page 2022
[6]

Elias Frantar and Dan Alistarh

V. Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alis- tarh, Extreme compression of large language models via additive quantization , arXiv preprint arXiv:2401.06118, (2024)

work page arXiv 2024
[7]

Foucart and H

S. Foucart and H. Rauhut, A mathematical introduction to compressive sensing, Applied and numerical harmonic analysis (, (2013)

work page 2013
[8]

Franco, P

G. Franco, P. Monteagudo-Lago, I. Colbert, N. Fraser, and M. Blott , Improving quantization with post-training model expansion, arXiv preprint arXiv:2503.17513, (2025)

work page arXiv 2025
[9]

librosa/librosa: 0.6.3,

G. Franco, A. Pappalardo, and N. J. Fraser , Xilinx/brevitas, 2025, https://doi.org/10.5281/zenodo. 3333552, https://doi.org/10.5281/zenodo.3333552

work page doi:10.5281/zenodo 2025
[10]

Frantar and D

E. Frantar and D. Alistarh , Optimal brain compression: A framework for accurate post-training quantization and pruning , Advances in Neural Information Processing Systems, 35 (2022), pp. 4475– 4488

work page 2022
[11]

Frantar and D

E. Frantar and D. Alistarh , Sparsegpt: Massive language models can be accurately pruned in one-shot, in International Conference on Machine Learning, PMLR, 2023, pp. 10323–10337

work page 2023
[12]

Frantar, S

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh , Gptq: Accurate post-training quantization for generative pre-trained transformers, (2022)

work page 2022
[13]

Gholami, S

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer , A survey of quanti- zation methods for efficient neural network inference , in Low-power computer vision, Chapman and Hall/CRC, 2022, pp. 291–326

work page 2022
[14]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. V aughan, et al. , The llama 3 herd of models , arXiv preprint arXiv:2407.21783, (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Hassibi and D

B. Hassibi and D. Stork , Second order derivatives for network pruning: Optimal brain surgeon , Ad- vances in neural information processing systems, 5 (1992)

work page 1992
[16]

Hassibi, D

B. Hassibi, D. G. Stork, and G. J. Wolff , Optimal brain surgeon and general network pruning , in IEEE international conference on neural networks, IEEE, 1993, pp. 293–299

work page 1993
[17]

Hassibi and H

B. Hassibi and H. Vikalo , On the expected complexity of integer least-squares problems , in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, IEEE, 2002, pp. II–1497

work page 2002
[18]

Hubara, Y

I. Hubara, Y. Nahshan, Y. Hanani, R. Banner, and D. Soudry , Accurate post training quantization with small calibration sets , in Proceedings of the 38th International Conference on Machine Learning, M. Meila and T. Zhang, eds., vol. 139 of Proceedings of Machine Learning Research, PMLR, 18–24 Jul 2021, pp. 4466–4475, https://proceedings.mlr.press/v139/hu...

work page 2021
[19]

M. Huh, H. Mobahi, R. Zhang, B. Cheung, P. Agrawal, and P. Isola , The low-rank simplicity bias in deep networks , arXiv preprint arXiv:2103.10427, (2021)

work page arXiv 2021
[20]

https://github.com/ist-daslab/gptq, 2022

IST-DASLab, gptq. https://github.com/ist-daslab/gptq, 2022

work page 2022
[21]

Jacob, S

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko , Quantization and training of neural networks for efficient integer-arithmetic-only inference , in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018
[22]

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. W ang, G. Xiao, X. Dang, C. Gan, and S. Han , Awq: Activation-aware weight quantization for on-device llm compression and acceleration , Proceedings of Machine Learning and Systems, 6 (2024), pp. 87–100

work page 2024
[23]

Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, and T. Blankevoort , Spinquant: LLM quantization with learned rotations , in The Thir- teenth International Conference on Learning Representations, 2025, https://openreview.net/forum? id=ogO6DGE6FZ

work page 2025
[24]

Lybrand and R

E. Lybrand and R. Saab , A greedy algorithm for quantizing neural networks , Journal of Machine Learning Research, 22 (2021), pp. 1–38

work page 2021
[25]

https://github.com/modelcloud/ gptqmodel, 2024

ModelCloud.ai and qubitium@modelcloud.ai , Gptqmodel. https://github.com/modelcloud/ gptqmodel, 2024. Contact: qubitium@modelcloud.ai

work page 2024
[26]

Nagel, R

M. Nagel, R. A. Amjad, M. V an Baalen, C. Louizos, and T. Blankevoort , Up or down? adaptive rounding for post-training quantization, in International conference on machine learning, PMLR, 2020, pp. 7197–7206

work page 2020
[27]

Nagel, M

M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling , Data-free quantization through weight equalization and bias correction , in Proceedings of the IEEE/CVF international conference on com- puter vision, 2019, pp. 1325–1334

work page 2019
[28]

W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y. Qiao, and P. Luo , Omniquant: Omnidirectionally calibrated quantization for large language models , in The Twelfth International Conference on Learning Representations, 2024, https://openreview.net/forum? id=8Wuvhh0LYW

work page 2024
[29]

C. Shi, H. Yang, D. Cai, Z. Zhang, Y. W ang, Y. Yang, and W. Lam , A thorough examination of decoding methods in the era of llms , arXiv preprint arXiv:2402.06925, (2024)

work page arXiv 2024
[30]

Strang , The discrete cosine transform , SIAM Review, 41 (1999), pp

G. Strang , The discrete cosine transform , SIAM Review, 41 (1999), pp. 135–147, https://doi.org/10. 1137/S0036144598336745, https://doi.org/10.1137/S0036144598336745, https://arxiv.org/abs/https: //doi.org/10.1137/S0036144598336745

work page doi:10.1137/s0036144598336745 1999
[31]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

A. Tseng, J. Chee, Q. Sun, V. Kuleshov, and C. De Sa , Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks, arXiv preprint arXiv:2402.04396, (2024)

work page arXiv 2024
[32]

H. Xi, C. Li, J. Chen, and J. Zhu , Training transformers with 4-bit integers , Advances in Neural Information Processing Systems, 36 (2023), pp. 49146–49168

work page 2023
[33]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han , Smoothquant: Accurate and efficient post-training quantization for large language models, in International Conference on Machine Learning, PMLR, 2023, pp. 38087–38099

work page 2023
[34]

Xu and J

C. Xu and J. McAuley , A survey on model compression and acceleration for pretrained language models, in Proceedings of the AAAI Conference on Artificial Intelligence, 2023

work page 2023
[35]

Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He , Zeroquant: Efficient and affordable post-training quantization for large-scale transformers , in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds., vol. 35, Curran Associates, Inc., 2022, pp. 27168–27183, https://proce...

work page 2022
[36]

Zhang, N

A. Zhang, N. W ang, Y. Deng, X. Li, Z. Yang, and P. Yin , Magr: Weight magnitude reduction for enhancing post-training quantization, arXiv preprint arXiv:2406.00800, (2024)

work page arXiv 2024
[37]

Zhang, J

C. Zhang, J. T. Wong, C. Xiao, G. A. Constantinides, and Y. Zhao , Qera: an analytical framework for quantization error reconstruction, arXiv preprint arXiv:2410.06040, (2024)

work page arXiv 2024
[38]

Zhang and R

H. Zhang and R. Saab , Unified stochastic framework for neural network quantization and pruning , Ap- plied and Computational Harmonic Analysis, 79 (2025), p. 101778, https://doi.org/https://doi.org/ 10.1016/j.acha.2025.101778, https://www.sciencedirect.com/science/article/pii/S1063520325000326

work page doi:10.1016/j.acha.2025.101778 2025
[39]

Zhang and R

J. Zhang and R. Saab , Spfq: A stochastic algorithm and its error analysis for neural network quanti- 24 H. ZHANG, S. ZHANG, I. COLBERT, R. SAAB zation, arXiv preprint arXiv:2309.10975, (2023)

work page arXiv 2023
[40]

Zhang, Y

J. Zhang, Y. Zhou, and R. Saab , Post-training quantization for neural networks with provable guar- antees, SIAM Journal on Mathematics of Data Science, 5 (2023), pp. 373–399

work page 2023
[41]

Zhang and R

S. Zhang and R. Saab , Theoretical guarantees for low-rank compression of deep neural networks , arXiv preprint arXiv:2502.02766, (2025)

work page arXiv 2025
[42]

Zhang, H

S. Zhang, H. Zhang, I. Colbert, and R. Saab , Qronos: Correcting the past by shaping the future... in post-training quantization, arXiv preprint arXiv:2505.11695, (2025)

work page arXiv 2025
[43]

Zhang, I

X. Zhang, I. Colbert, and S. Das , Learning low-precision structured subnetworks using joint layerwise channel pruning and uniform quantization , Applied Sciences, 12 (2022), p. 7829

work page 2022
[44]

X. Zhu, J. Li, Y. Liu, C. Ma, and W. W ang , A survey on model compression for large language models, arXiv preprint arXiv:2308.07633, (2023). THEORETICAL ANALYSIS OF OPTQ AND QRONOS 25 Appendix A. Proof of lemmas for OPTQ Error Analysis. The following lemma from [42] shows that the OPTQ update can be interpreted as the optimal adjustment of the remaining...

work page arXiv 2023
[45]

This completes the proof

As a result, ∥Σ∥op = maxj ∥vj∥2 2 = maxj ∥PX ⊥ ≥j+1 Xj∥2. This completes the proof. 28 H. ZHANG, S. ZHANG, I. COLBERT, R. SAAB Appendix B. Auxiliary Lemmas. Lemma B.1. Suppose X ∈ Rm×N. Let bX be the matrix X√ λI and σ(j) min be the smallest singular value of X≥j+1. Then ∥P bX ⊥ ≥j+1 bXj∥2 2 ≤    λ (σ(j) min)2+λ · ∥Xj∥2 2 + λ when m ≤ N − j ∥Xj∥2 2 + λ...

work page
[46]

(Lemma 2.4 in [2]) If X ≺cx Y , then for any linear transformation M on Rn, we have M X ≺cx M Y

work page
[47]

(Lemma A.2 in [39]) If A and B are two positive semi-definite matrices and A ⪯ B, then N (0, A) ≺cx N (0, B)

work page
[48]

Let U and V live on the same probability space, and letE and F be independent

(Lemma 2.5 in [2]) Consider random vectors U, V , E, and F . Let U and V live on the same probability space, and letE and F be independent. Suppose that U ≺cx E and (V −U)|U ≺cx F . Then V ≺cx E + F

work page
[49]

Then X ≺cx N 0, πC 2 2

(Lemma 2.6 in [2]) Let X be a real-valued random variable with EX = 0 and |X| ≤ C. Then X ≺cx N 0, πC 2 2

work page
[50]

Then P ∥X − µ∥∞ ≤ α ≥ 1 − √ 2ne− α2 4σ2

(Lemma B.2 in [39]) Let X be an n-dimensional random vector such that X ≺cx N (µ, σ2I), and let α > 0. Then P ∥X − µ∥∞ ≤ α ≥ 1 − √ 2ne− α2 4σ2 . Appendix D. An Adversarial Construction for OPTQ. Here, we construct a matrix X and vector w so that OPTQ with a infinite alphabet results in ∥X(w − q)∥∞ = ∥X(w − q)∥2 = O( √ N), and also ∥q∥∞ = O(N), despite hav...

work page

[1] [1]

Adepu, Z

H. Adepu, Z. Zeng, L. Zhang, and V. Singh , Framequant: Flexible low-bit quantization for transform- ers, arXiv preprint arXiv:2403.06082, (2024)

work page arXiv 2024

[2] [2]

Alweiss, Y

R. Alweiss, Y. P. Liu, and M. Sawhney , Discrepancy minimization via a self-balancing walk , in Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021, pp. 14– 20

work page 2021

[3] [3]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoef- ler, and J. Hensman , Quarot: Outlier-free 4-bit inference in rotated llms , Advances in Neural Information Processing Systems, 37 (2024), pp. 100213–100240

work page 2024

[4] [4]

J. Chee, Y. Cai, V. Kuleshov, and C. M. De Sa , Quip: 2-bit quantization of large language models with guarantees, Advances in Neural Information Processing Systems, 36 (2023), pp. 4396–4429

work page 2023

[5] [5]

Dettmers, M

T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer , Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , Advances in neural information processing systems, 35 (2022), pp. 30318– 30332

work page 2022

[6] [6]

Elias Frantar and Dan Alistarh

V. Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alis- tarh, Extreme compression of large language models via additive quantization , arXiv preprint arXiv:2401.06118, (2024)

work page arXiv 2024

[7] [7]

Foucart and H

S. Foucart and H. Rauhut, A mathematical introduction to compressive sensing, Applied and numerical harmonic analysis (, (2013)

work page 2013

[8] [8]

Franco, P

G. Franco, P. Monteagudo-Lago, I. Colbert, N. Fraser, and M. Blott , Improving quantization with post-training model expansion, arXiv preprint arXiv:2503.17513, (2025)

work page arXiv 2025

[9] [9]

librosa/librosa: 0.6.3,

G. Franco, A. Pappalardo, and N. J. Fraser , Xilinx/brevitas, 2025, https://doi.org/10.5281/zenodo. 3333552, https://doi.org/10.5281/zenodo.3333552

work page doi:10.5281/zenodo 2025

[10] [10]

Frantar and D

E. Frantar and D. Alistarh , Optimal brain compression: A framework for accurate post-training quantization and pruning , Advances in Neural Information Processing Systems, 35 (2022), pp. 4475– 4488

work page 2022

[11] [11]

Frantar and D

E. Frantar and D. Alistarh , Sparsegpt: Massive language models can be accurately pruned in one-shot, in International Conference on Machine Learning, PMLR, 2023, pp. 10323–10337

work page 2023

[12] [12]

Frantar, S

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh , Gptq: Accurate post-training quantization for generative pre-trained transformers, (2022)

work page 2022

[13] [13]

Gholami, S

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer , A survey of quanti- zation methods for efficient neural network inference , in Low-power computer vision, Chapman and Hall/CRC, 2022, pp. 291–326

work page 2022

[14] [14]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. V aughan, et al. , The llama 3 herd of models , arXiv preprint arXiv:2407.21783, (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Hassibi and D

B. Hassibi and D. Stork , Second order derivatives for network pruning: Optimal brain surgeon , Ad- vances in neural information processing systems, 5 (1992)

work page 1992

[16] [16]

Hassibi, D

B. Hassibi, D. G. Stork, and G. J. Wolff , Optimal brain surgeon and general network pruning , in IEEE international conference on neural networks, IEEE, 1993, pp. 293–299

work page 1993

[17] [17]

Hassibi and H

B. Hassibi and H. Vikalo , On the expected complexity of integer least-squares problems , in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, IEEE, 2002, pp. II–1497

work page 2002

[18] [18]

Hubara, Y

I. Hubara, Y. Nahshan, Y. Hanani, R. Banner, and D. Soudry , Accurate post training quantization with small calibration sets , in Proceedings of the 38th International Conference on Machine Learning, M. Meila and T. Zhang, eds., vol. 139 of Proceedings of Machine Learning Research, PMLR, 18–24 Jul 2021, pp. 4466–4475, https://proceedings.mlr.press/v139/hu...

work page 2021

[19] [19]

M. Huh, H. Mobahi, R. Zhang, B. Cheung, P. Agrawal, and P. Isola , The low-rank simplicity bias in deep networks , arXiv preprint arXiv:2103.10427, (2021)

work page arXiv 2021

[20] [20]

https://github.com/ist-daslab/gptq, 2022

IST-DASLab, gptq. https://github.com/ist-daslab/gptq, 2022

work page 2022

[21] [21]

Jacob, S

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko , Quantization and training of neural networks for efficient integer-arithmetic-only inference , in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018

[22] [22]

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. W ang, G. Xiao, X. Dang, C. Gan, and S. Han , Awq: Activation-aware weight quantization for on-device llm compression and acceleration , Proceedings of Machine Learning and Systems, 6 (2024), pp. 87–100

work page 2024

[23] [23]

Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, and T. Blankevoort , Spinquant: LLM quantization with learned rotations , in The Thir- teenth International Conference on Learning Representations, 2025, https://openreview.net/forum? id=ogO6DGE6FZ

work page 2025

[24] [24]

Lybrand and R

E. Lybrand and R. Saab , A greedy algorithm for quantizing neural networks , Journal of Machine Learning Research, 22 (2021), pp. 1–38

work page 2021

[25] [25]

https://github.com/modelcloud/ gptqmodel, 2024

ModelCloud.ai and qubitium@modelcloud.ai , Gptqmodel. https://github.com/modelcloud/ gptqmodel, 2024. Contact: qubitium@modelcloud.ai

work page 2024

[26] [26]

Nagel, R

M. Nagel, R. A. Amjad, M. V an Baalen, C. Louizos, and T. Blankevoort , Up or down? adaptive rounding for post-training quantization, in International conference on machine learning, PMLR, 2020, pp. 7197–7206

work page 2020

[27] [27]

Nagel, M

M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling , Data-free quantization through weight equalization and bias correction , in Proceedings of the IEEE/CVF international conference on com- puter vision, 2019, pp. 1325–1334

work page 2019

[28] [28]

W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y. Qiao, and P. Luo , Omniquant: Omnidirectionally calibrated quantization for large language models , in The Twelfth International Conference on Learning Representations, 2024, https://openreview.net/forum? id=8Wuvhh0LYW

work page 2024

[29] [29]

C. Shi, H. Yang, D. Cai, Z. Zhang, Y. W ang, Y. Yang, and W. Lam , A thorough examination of decoding methods in the era of llms , arXiv preprint arXiv:2402.06925, (2024)

work page arXiv 2024

[30] [30]

Strang , The discrete cosine transform , SIAM Review, 41 (1999), pp

G. Strang , The discrete cosine transform , SIAM Review, 41 (1999), pp. 135–147, https://doi.org/10. 1137/S0036144598336745, https://doi.org/10.1137/S0036144598336745, https://arxiv.org/abs/https: //doi.org/10.1137/S0036144598336745

work page doi:10.1137/s0036144598336745 1999

[31] [31]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

A. Tseng, J. Chee, Q. Sun, V. Kuleshov, and C. De Sa , Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks, arXiv preprint arXiv:2402.04396, (2024)

work page arXiv 2024

[32] [32]

H. Xi, C. Li, J. Chen, and J. Zhu , Training transformers with 4-bit integers , Advances in Neural Information Processing Systems, 36 (2023), pp. 49146–49168

work page 2023

[33] [33]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han , Smoothquant: Accurate and efficient post-training quantization for large language models, in International Conference on Machine Learning, PMLR, 2023, pp. 38087–38099

work page 2023

[34] [34]

Xu and J

C. Xu and J. McAuley , A survey on model compression and acceleration for pretrained language models, in Proceedings of the AAAI Conference on Artificial Intelligence, 2023

work page 2023

[35] [35]

Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He , Zeroquant: Efficient and affordable post-training quantization for large-scale transformers , in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds., vol. 35, Curran Associates, Inc., 2022, pp. 27168–27183, https://proce...

work page 2022

[36] [36]

Zhang, N

A. Zhang, N. W ang, Y. Deng, X. Li, Z. Yang, and P. Yin , Magr: Weight magnitude reduction for enhancing post-training quantization, arXiv preprint arXiv:2406.00800, (2024)

work page arXiv 2024

[37] [37]

Zhang, J

C. Zhang, J. T. Wong, C. Xiao, G. A. Constantinides, and Y. Zhao , Qera: an analytical framework for quantization error reconstruction, arXiv preprint arXiv:2410.06040, (2024)

work page arXiv 2024

[38] [38]

Zhang and R

H. Zhang and R. Saab , Unified stochastic framework for neural network quantization and pruning , Ap- plied and Computational Harmonic Analysis, 79 (2025), p. 101778, https://doi.org/https://doi.org/ 10.1016/j.acha.2025.101778, https://www.sciencedirect.com/science/article/pii/S1063520325000326

work page doi:10.1016/j.acha.2025.101778 2025

[39] [39]

Zhang and R

J. Zhang and R. Saab , Spfq: A stochastic algorithm and its error analysis for neural network quanti- 24 H. ZHANG, S. ZHANG, I. COLBERT, R. SAAB zation, arXiv preprint arXiv:2309.10975, (2023)

work page arXiv 2023

[40] [40]

Zhang, Y

J. Zhang, Y. Zhou, and R. Saab , Post-training quantization for neural networks with provable guar- antees, SIAM Journal on Mathematics of Data Science, 5 (2023), pp. 373–399

work page 2023

[41] [41]

Zhang and R

S. Zhang and R. Saab , Theoretical guarantees for low-rank compression of deep neural networks , arXiv preprint arXiv:2502.02766, (2025)

work page arXiv 2025

[42] [42]

Zhang, H

S. Zhang, H. Zhang, I. Colbert, and R. Saab , Qronos: Correcting the past by shaping the future... in post-training quantization, arXiv preprint arXiv:2505.11695, (2025)

work page arXiv 2025

[43] [43]

Zhang, I

X. Zhang, I. Colbert, and S. Das , Learning low-precision structured subnetworks using joint layerwise channel pruning and uniform quantization , Applied Sciences, 12 (2022), p. 7829

work page 2022

[44] [44]

X. Zhu, J. Li, Y. Liu, C. Ma, and W. W ang , A survey on model compression for large language models, arXiv preprint arXiv:2308.07633, (2023). THEORETICAL ANALYSIS OF OPTQ AND QRONOS 25 Appendix A. Proof of lemmas for OPTQ Error Analysis. The following lemma from [42] shows that the OPTQ update can be interpreted as the optimal adjustment of the remaining...

work page arXiv 2023

[45] [45]

This completes the proof

As a result, ∥Σ∥op = maxj ∥vj∥2 2 = maxj ∥PX ⊥ ≥j+1 Xj∥2. This completes the proof. 28 H. ZHANG, S. ZHANG, I. COLBERT, R. SAAB Appendix B. Auxiliary Lemmas. Lemma B.1. Suppose X ∈ Rm×N. Let bX be the matrix X√ λI and σ(j) min be the smallest singular value of X≥j+1. Then ∥P bX ⊥ ≥j+1 bXj∥2 2 ≤    λ (σ(j) min)2+λ · ∥Xj∥2 2 + λ when m ≤ N − j ∥Xj∥2 2 + λ...

work page

[46] [46]

(Lemma 2.4 in [2]) If X ≺cx Y , then for any linear transformation M on Rn, we have M X ≺cx M Y

work page

[47] [47]

(Lemma A.2 in [39]) If A and B are two positive semi-definite matrices and A ⪯ B, then N (0, A) ≺cx N (0, B)

work page

[48] [48]

Let U and V live on the same probability space, and letE and F be independent

(Lemma 2.5 in [2]) Consider random vectors U, V , E, and F . Let U and V live on the same probability space, and letE and F be independent. Suppose that U ≺cx E and (V −U)|U ≺cx F . Then V ≺cx E + F

work page

[49] [49]

Then X ≺cx N 0, πC 2 2

(Lemma 2.6 in [2]) Let X be a real-valued random variable with EX = 0 and |X| ≤ C. Then X ≺cx N 0, πC 2 2

work page

[50] [50]

Then P ∥X − µ∥∞ ≤ α ≥ 1 − √ 2ne− α2 4σ2

(Lemma B.2 in [39]) Let X be an n-dimensional random vector such that X ≺cx N (µ, σ2I), and let α > 0. Then P ∥X − µ∥∞ ≤ α ≥ 1 − √ 2ne− α2 4σ2 . Appendix D. An Adversarial Construction for OPTQ. Here, we construct a matrix X and vector w so that OPTQ with a infinite alphabet results in ∥X(w − q)∥∞ = ∥X(w − q)∥2 = O( √ N), and also ∥q∥∞ = O(N), despite hav...

work page