pith. machine review for the scientific record. sign in

arxiv: 2605.05693 · v2 · submitted 2026-05-07 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Saliency-Aware Regularized Quantization Calibration for Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:52 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords post-training quantizationlarge language modelsquantization calibrationregularizationsaliencygeneralization
0
0 comments X

The pith

Adding a regularizer to keep quantized weights close to their original floating-point values during calibration reduces generalization error in post-training quantization of large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard post-training quantization methods calibrate by minimizing reconstruction error on a small dataset, which can push quantized weights far from the original floating-point values and hurt performance on new data. The paper introduces Regularized Quantization Calibration (RQC) that adds a penalty term to the objective to control this deviation, then extends it to Saliency-Aware RQC (SARQC) that weights the penalty by how important each weight is. This change integrates into existing scale-search and Gram-based pipelines without changing inference speed. Experiments on dense and Mixture-of-Experts models show lower perplexity and higher zero-shot accuracy after the modification.

Core claim

The core claim is that existing calibration objectives based only on empirical reconstruction error over limited calibration data can increase the distance between quantized and original weights, raising generalization risk, and that explicitly adding a regularizer to minimize this distance (further modulated by saliency) produces quantized models with better downstream performance while remaining compatible with prior PTQ techniques.

What carries the argument

Regularized Quantization Calibration (RQC), a framework that augments the standard PTQ loss with an explicit term penalizing the deviation of quantized weights from the original floating-point weights, optionally weighted by saliency.

If this is right

  • Both scale-search-based and Gram-based calibration methods gain from the same unified regularization term.
  • The approach applies to both dense and Mixture-of-Experts architectures without extra inference cost.
  • Perplexity and zero-shot task accuracy improve consistently across tested models.
  • The regularizer can be added to existing PTQ code with minimal changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same deviation penalty might reduce sensitivity to the exact choice or size of the calibration dataset.
  • Because the regularizer acts only at calibration time, it could combine with other compression steps such as pruning or distillation.
  • If saliency weighting proves important, future work could test whether automatically learned saliency maps outperform the hand-crafted ones used here.

Load-bearing premise

That penalizing weight deviation from the original floating-point values on a small calibration set will reliably improve performance on unseen data without requiring per-model retuning or creating new instabilities.

What would settle it

Run the same PTQ pipelines on the same LLMs and calibration data with and without the added regularization term, then measure whether zero-shot accuracy drops by more than the reported gain when the term is removed.

Figures

Figures reproduced from arXiv: 2605.05693 by Baihua He, Harrison Bo Hua Zhu, Huihang Liu, Li Zeng, Wenlong Chen, Xiaoyuan Cheng, Xinyu Zhang, Yanlong Zhao, Zhuo Sun.

Figure 1
Figure 1. Figure 1: Illustration and validation of our motivation. view at source ↗
Figure 2
Figure 2. Figure 2: Ablation study. (a) Extension to OmniQuant: the y-axis shows downstream task accuracy, and the bars compare FP16/BF16, OmniQuant, and OmniQuant augmented with SARQC on the reported tasks for LLaMA2-7B and Mixtral-8x7B under W4A16. (b) Effect of calibration size: the x-axis shows the number of calibration samples, and the y-axis shows average downstream accuracy. The vertical dashed segments indicate the ac… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of the regularization strength λ for SARQC-GS on LLaMA2-7B under INT4 weight￾only quantization view at source ↗
Figure 3
Figure 3. Figure 3: Effect of the regularization strength λ for SARQC-GS on LLaMA2-7B under INT4 weight￾only quantization [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the element-wise weight discrepancy view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the element-wise weight discrepancy [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗
read the original abstract

Post-training quantization (PTQ) is an effective approach for deploying large language models (LLMs) under memory and latency constraints. Most existing PTQ methods determine quantization parameters by minimizing a layer-wise reconstruction error on a predetermined calibration dataset, typically optimized via either scale search or Gram-based methods. However, from the perspective of generalization risk, existing PTQ calibration objectives based solely on empirical reconstruction error over limited or unrepresentative calibration data may move the quantized weights away from the original floating-point weights, potentially degrading downstream performance. To address this issue, we propose \emph{Regularized Quantization Calibration} (RQC), a unified framework that augments standard PTQ objectives with a regularizer that explicitly controls weight deviation from the original weights. We further generalize this framework to incorporate a saliency-aware regularizer, resulting in \emph{Saliency-Aware Regularized Quantization Calibration} (SARQC). The proposed regularization encourages quantized weights to remain close to the original weights during calibration, leading to improved generalization at inference time. SARQC integrates seamlessly into existing PTQ pipelines and enhances both scale-search-based and Gram-based methods under a unified formulation. Extensive experiments on dense and Mixture-of-Experts LLMs demonstrate consistent improvements in perplexity and zero-shot accuracy, without introducing additional inference overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Regularized Quantization Calibration (RQC) and its saliency-aware variant SARQC for post-training quantization of LLMs. Standard PTQ objectives (scale-search or Gram-based) that minimize layer-wise reconstruction error on limited calibration data are augmented with an explicit regularizer penalizing deviation of quantized weights from the original FP32 weights; the saliency-aware version weights this term by a saliency map. The central claim is that this prevents the quantized solution from drifting away from the FP32 solution, yielding improved generalization measured by lower perplexity and higher zero-shot accuracy on both dense and MoE models, with no added inference cost and seamless integration into existing pipelines.

Significance. If the empirical gains prove robust, the work supplies a lightweight, unified regularization prior that directly targets the generalization risk inherent in empirical-risk minimization over small calibration sets. Because the modification is confined to the calibration stage and adds no runtime overhead, it could be adopted as a drop-in enhancement for a wide range of existing PTQ algorithms.

major comments (2)
  1. [Abstract] Abstract: the claim of 'consistent improvements in perplexity and zero-shot accuracy' is asserted without any numerical values, tables, error bars, or description of calibration-set size/selection. This absence is load-bearing because the paper's contribution rests entirely on the magnitude and reliability of these gains; without them the reader cannot judge whether the regularizer delivers meaningful benefit or merely marginal noise.
  2. [Method (regularizer definition)] The formulation of the regularizer (both plain L2 and saliency-weighted) is derived from the same limited calibration data used for the reconstruction loss. Because the saliency map and Gram matrix are estimated on this data, the added quadratic term can amplify whatever distribution shift exists between calibration and test distributions rather than acting as a true generalization prior. The manuscript should therefore contain an explicit ablation (e.g., cross-calibration-set experiments or measurement of weight deviation on held-out data) showing that the regularizer does not reinforce calibration-set biases; no such analysis is referenced in the abstract or skeptic summary.
minor comments (2)
  1. [Abstract] The abstract states that SARQC 'enhances both scale-search-based and Gram-based methods under a unified formulation' but does not name the concrete baselines (e.g., GPTQ, AWQ, or specific Gram solvers). Adding these references would clarify the scope of the claimed improvement.
  2. Notation for the saliency map and the regularization strength hyper-parameter should be introduced once and used consistently; currently the abstract uses 'saliency-aware regularizer' without defining the symbol or its estimation procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our work. We address the major comments point by point below, indicating the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'consistent improvements in perplexity and zero-shot accuracy' is asserted without any numerical values, tables, error bars, or description of calibration-set size/selection. This absence is load-bearing because the paper's contribution rests entirely on the magnitude and reliability of these gains; without them the reader cannot judge whether the regularizer delivers meaningful benefit or merely marginal noise.

    Authors: We agree with the referee that including specific numerical evidence in the abstract would strengthen the presentation and allow readers to better evaluate the practical impact of SARQC. In the revised manuscript, we will modify the abstract to incorporate key quantitative results from our experiments, such as the average reductions in perplexity and improvements in zero-shot accuracy across the evaluated models. We will also briefly note the calibration dataset details, including size and selection method, to provide necessary context. revision: yes

  2. Referee: [Method (regularizer definition)] The formulation of the regularizer (both plain L2 and saliency-weighted) is derived from the same limited calibration data used for the reconstruction loss. Because the saliency map and Gram matrix are estimated on this data, the added quadratic term can amplify whatever distribution shift exists between calibration and test distributions rather than acting as a true generalization prior. The manuscript should therefore contain an explicit ablation (e.g., cross-calibration-set experiments or measurement of weight deviation on held-out data) showing that the regularizer does not reinforce calibration-set biases; no such analysis is referenced in the abstract or skeptic summary.

    Authors: This is a valid concern about the potential for the regularizer to overfit to the calibration data. To address this, we will include an additional ablation study in the revised manuscript. Specifically, we will perform cross-calibration-set experiments where the saliency map and regularizer are computed using one calibration dataset and the quantization is evaluated on models using a different calibration set. Additionally, we will report measurements of weight deviation on held-out data to demonstrate that the regularizer promotes closeness to the original weights without amplifying biases. We believe these additions will clarify that the improvements stem from better generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the SARQC derivation chain

full rationale

The paper augments standard PTQ reconstruction objectives (scale-search or Gram-based) with an explicit regularization term that penalizes deviation of quantized weights from the original FP32 weights, optionally weighted by a saliency map. This is presented as a direct, additive modification to the calibration loss rather than a quantity derived from or equivalent to the inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations to prior author work appear in the abstract or described framework. The method is self-contained as an empirical extension to existing pipelines, with performance claims resting on downstream experiments rather than tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that limited calibration data causes existing PTQ methods to overfit and degrade generalization; the paper introduces a new regularizer term whose strength and saliency computation are not detailed in the abstract.

axioms (1)
  • domain assumption Existing PTQ calibration based solely on empirical reconstruction error over limited data moves quantized weights away from originals and degrades downstream performance.
    Explicitly stated in the abstract as the motivation for adding regularization.

pith-pipeline@v0.9.0 · 5555 in / 1306 out tokens · 63181 ms · 2026-05-11T00:52:30.904274+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 1 internal anchor

  1. [1]

    Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems, volume 37, pages 100213–100240. Curran Associates, Inc., 2024

  2. [2]

    Post training 4-bit quantization of convolu- tional networks for rapid-deployment

    Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolu- tional networks for rapid-deployment. InAdvances in Neural Information Processing Systems, volume 32, pages 7950–7958. Curran Associates, Inc., 2019

  3. [3]

    Cambridge University Press, Cambridge, 2004

    Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, Cambridge, 2004

  4. [4]

    Mikhail A. Bragin. Survey on Lagrangian relaxation for MILP: Importance, challenges, historical review, recent advancements, and opportunities.Annals of Operations Research, 333(1):29–45, 2024

  5. [5]

    Regularized calibration with successive rounding for post-training quantization.arXiv preprint arXiv:2602.05902, 2026

    Seohyeon Cha, Huancheng Chen, Dongjun Kim, Haoran Zhang, Kevin Chan, Gustavo de Ve- ciana, and Haris Vikalo. Regularized calibration with successive rounding for post-training quantization.arXiv preprint arXiv:2602.05902, 2026

  6. [6]

    QuIP: 2-bit quantization of large language models with guarantees

    Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InAdvances in Neural Information Processing Systems, volume 36, pages 4396–4429. Curran Associates, Inc., 2023

  7. [7]

    Evaluating large language models trained on code, 2021

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, et al. Evaluating large language models trained on code, 2021

  8. [8]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

  9. [9]

    Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018

  10. [10]

    Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models, 2024

  11. [11]

    Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh

    Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. InThe Twelfth International Confer- ence on Learning Representations, 2024

  12. [12]

    Di Pillo and L

    G. Di Pillo and L. Grippo. Exact penalty functions in constrained optimization.SIAM Journal on Control and Optimization, 27(6):1333–1360, 1989. 10

  13. [13]

    Nonsmooth exact penalty methods for equality-constrained optimization: Complexity and implementation.SIAM Journal on Optimization, 36(2):626–650, 2026

    Youssef Diouane, Maxence Gollier, and Dominique Orban. Nonsmooth exact penalty methods for equality-constrained optimization: Complexity and implementation.SIAM Journal on Optimization, 36(2):626–650, 2026

  14. [14]

    OAC: output-adaptive calibration for accurate post-training quantization

    Ali Edalati, Alireza Ghaffari, Mahsa Ghazvini Nejad, Lu Hou, Boxing Chen, Masoud Asgharian, and Vahid Partovi Nia. OAC: output-adaptive calibration for accurate post-training quantization. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fif...

  15. [15]

    Springer, Berlin, Heidelberg, 2 edition, 2005

    Matthias Ehrgott.Multicriteria Optimization. Springer, Berlin, Heidelberg, 2 edition, 2005

  16. [16]

    Generalized lagrange multiplier method for solving problems of optimum allocation of resources.Operations Research, 11(3):399–417, 1963

    Hugh Everett, III. Generalized lagrange multiplier method for solving problems of optimum allocation of resources.Operations Research, 11(3):399–417, 1963

  17. [17]

    Marshall L. Fisher. The Lagrangian relaxation method for solving integer programming problems.Management Science, 27(1):1–18, 1981

  18. [18]

    GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2023

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2023

  19. [19]

    Optimal brain compression: A framework for accurate post-training quantization and pruning

    Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning. InAdvances in Neural Information Processing Systems, volume 35, pages 4475–4488, 2022

  20. [20]

    Geoffrion

    Arthur M. Geoffrion. Proper efficiency and the theory of vector maximization.Journal of Mathematical Analysis and Applications, 22(3):618–630, 1968

  21. [21]

    Geoffrion

    Arthur M. Geoffrion. Lagrangean relaxation for integer programming. In M. L. Balinski, editor, Approaches to Integer Programming, volume 2 ofMathematical Programming Studies, pages 82–114. Springer, Berlin, Heidelberg, 1974

  22. [22]

    Rethinking post-training quantization: Introducing a statistical pre-calibration approach

    Alireza Ghaffari, Sharareh Younesian, Boxing Chen, Vahid Partovi Nia, and Masoud Asgharian. Rethinking post-training quantization: Introducing a statistical pre-calibration approach. In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods, pages 159–169. SciTePress, 2025

  23. [23]

    Mahoney, and Kurt Keutzer

    Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference, 2021

  24. [24]

    The Llama 3 herd of models, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The Llama 3 herd of models, 2024

  25. [25]

    Han and O

    S.-P. Han and O. L. Mangasarian. Exact penalty functions in nonlinear programming.Mathe- matical Programming, 17:251–269, 1979

  26. [26]

    Stork, and Gregory Wolff

    Babak Hassibi, David G. Stork, and Gregory Wolff. Optimal brain surgeon: Extensions and performance comparisons. InAdvances in Neural Information Processing Systems, volume 6. Morgan-Kaufmann, 1993

  27. [27]

    Using scalarizations for the approximation of multiobjective optimization problems: Towards a general theory

    Stephan Helfrich, Arne Herzel, Stefan Ruzika, and Clemens Thielen. Using scalarizations for the approximation of multiobjective optimization problems: Towards a general theory. Mathematical Methods of Operations Research, 100:27–63, 2024

  28. [28]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

  29. [29]

    Hoerl and Robert W

    Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

  30. [30]

    Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, et al

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, et al. Mistral 7B, 2023. 11

  31. [31]

    Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, et al

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, et al. Mixtral of experts, 2024

  32. [32]

    Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens J. S. Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, and Hyun Oh Song. GuidedQuant: Large language model quantization via exploiting end loss guidance. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 30011–30037. PMLR, 2025

  33. [33]

    On supportedness in multi-objective integer linear programming.Journal of Multi-Criteria Decision Analysis, 32(3):e70024, 2025

    David Könen and Michael Stiglmayr. On supportedness in multi-objective integer linear programming.Journal of Multi-Criteria Decision Analysis, 32(3):e70024, 2025

  34. [34]

    Denker, and Sara A

    Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InAdvances in Neural Information Processing Systems, volume 2, pages 598–605, 1989

  35. [35]

    OWQ: Outlier- aware weight quantization for efficient fine-tuning and inference of large language models

    Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. OWQ: Outlier- aware weight quantization for efficient fine-tuning and inference of large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(12):13355–13364, 2024

  36. [36]

    BRECQ: Pushing the limit of post-training quantization by block reconstruction

    Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. BRECQ: Pushing the limit of post-training quantization by block reconstruction. InInternational Conference on Learning Representations, 2021

  37. [37]

    GPTAQ: Efficient finetuning-free quantization for asymmetric calibration

    Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda. GPTAQ: Efficient finetuning-free quantization for asymmetric calibration. InProceedings of the 42nd In- ternational Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 36690–36706. PMLR, 2025

  38. [38]

    ParoQuant: Pairwise rotation quantization for efficient reasoning LLM inference

    Yesheng Liang, Haisheng Chen, Song Han, and Zhijian Liu. ParoQuant: Pairwise rotation quantization for efficient reasoning LLM inference. InThe Fourteenth International Conference on Learning Representations, 2026

  39. [39]

    Haitao Liao, Xujing Yuan, and Ruxin Gao. An exact penalty function optimization method and its application in stress constrained topology optimization and scenario based reliability design problems.Applied Mathematical Modelling, 125:260–292, 2024

  40. [40]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. InProceedings of Machine Learning and Systems, volume 6, pages 87–100, 2024

  41. [41]

    PD- Quant: Post-training quantization based on prediction difference metric

    Jiawei Liu, Lin Niu, Zhihang Yuan, Dawei Yang, Xinggang Wang, and Wenyu Liu. PD- Quant: Post-training quantization based on prediction difference metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24427–24437, 2023

  42. [42]

    LLM-QAT: Data-free quan- tization aware training for large language models

    Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quan- tization aware training for large language models. InFindings of the Association for Compu- tational Linguistics: ACL 2024, pages 467–484, Bangkok, Thailand, 2024. Association for Computat...

  43. [43]

    SpinQuant: LLM quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025

  44. [44]

    Exact penalty functions for nonlinear integer program- ming problems.Journal of Optimization Theory and Applications, 145(3):479–488, 2010

    Stefano Lucidi and Francesco Rinaldi. Exact penalty functions for nonlinear integer program- ming problems.Journal of Optimization Theory and Applications, 145(3):479–488, 2010

  45. [45]

    Timothy Marler and Jasbir S

    R. Timothy Marler and Jasbir S. Arora. The weighted sum method for multi-objective optimiza- tion: New insights.Structural and Multidisciplinary Optimization, 41(6):853–862, 2010. 12

  46. [46]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2017

  47. [47]

    Springer, New York, 1998

    Kaisa Miettinen.Nonlinear Multiobjective Optimization, volume 12 ofInternational Series in Operations Research & Management Science. Springer, New York, 1998

  48. [48]

    Up or down? Adaptive rounding for post-training quantization

    Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? Adaptive rounding for post-training quantization. InProceedings of the 37th In- ternational Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 7197–7206. PMLR, 2020

  49. [49]

    A white paper on neural network quantization, 2021

    Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization, 2021

  50. [50]

    Wright.Numerical Optimization

    Jorge Nocedal and Stephen J. Wright.Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, 2 edition, 2006

  51. [51]

    GPT-4 technical report, 2023

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, et al. GPT-4 technical report, 2023

  52. [52]

    WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  53. [53]

    Enhancing post-training quantization calibration through contrastive learning

    Yuzhang Shang, Gaowen Liu, Ramana Rao Kompella, and Yan Yan. Enhancing post-training quantization calibration through contrastive learning. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 15921–15930, 2024

  54. [54]

    OmniQuant: Omnidirectionally calibrated quan- tization for large language models

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quan- tization for large language models. InThe Twelfth International Conference on Learning Representations, 2024

  55. [55]

    PocketLLM: Ultimate compression of large language models via meta networks, 2025

    Ye Tian, Chengcheng Wang, Jing Han, Yehui Tang, and Kai Han. PocketLLM: Ultimate compression of large language models via meta networks, 2025

  56. [56]

    Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996

    Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996

  57. [57]

    Llama 2: Open foundation and fine-tuned chat models, 2023

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, et al. Llama 2: Open foundation and fine-tuned chat models, 2023

  58. [58]

    GPTVQ: The blessing of dimensionality for LLM quantization

    Mart Van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Artem Bolshakov, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, and Paul Whatmough. GPTVQ: The blessing of dimensionality for LLM quantization. InWorkshop on Efficient Systems for Foundation Models II @ ICML 2024, 2024

  59. [59]

    MPPQ: Enhancing post-training quantization for LLMs via mixed supervision, proxy rounding, and pre-searching

    Mingrun Wei, Yeyu Yan, and Dong Wang. MPPQ: Enhancing post-training quantization for LLMs via mixed supervision, proxy rounding, and pre-searching. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 8277–8285, 2025

  60. [60]

    Outlier suppression: Pushing the limit of low-bit transformer language models

    Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. InAdvances in Neural Information Processing Systems, volume 35, pages 17402–17414. Curran Associates, Inc., 2022

  61. [61]

    Easyquant: Post- training quantization via scale optimization, 2020

    Di Wu, Qi Tang, Yongle Zhao, Ming Zhang, Ying Fu, and Debing Zhang. Easyquant: Post- training quantization via scale optimization, 2020

  62. [62]

    Training transformers with 4-bit integers

    Haocheng Xi, Changhao Li, Jianfei Chen, and Jun Zhu. Training transformers with 4-bit integers. InAdvances in Neural Information Processing Systems, volume 36, pages 49146–49168. Curran Associates, Inc., 2023. 13

  63. [63]

    SmoothQuant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 38087–38099. PMLR, 2023

  64. [64]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report, 2025

  65. [65]

    Lotfi A. Zadeh. Optimality and non-scalar-valued performance criteria.IEEE Transactions on Automatic Control, 8(1):59–60, 1963

  66. [66]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. 14 Appendix Appendix Overview This appendix offers...

  67. [67]

    22 Since0≤ℓ cWl (X)≤R 2M2 X, Hoeffding’s inequality gives, for any fixedcWl ∈ Q R, Pr R(cWl)− bRcal(cWl) ≥t ≤2 exp − 2nt2 R4M4 X

    Then for any fixedcWl ∈ Q R, R(cWl) =E[ℓ cWl (X)], and bRcal(cWl) = 1 n Pn i=1 ℓ cWl (Xl,i). 21 Since0≤ℓ cWl (X)≤R 2M2 X, Hoeffding’s inequality gives, for any fixedcWl ∈ Q R, Pr R(cWl)− bRcal(cWl) ≥t ≤2 exp − 2nt2 R4M4 X . SinceQ R is finite, we apply the union bound: Pr ∃cWl ∈ Q R : R(cWl)− bRcal(cWl) ≥t ≤2|Q R|exp − 2nt2 R4M4 X . Setting the right-hand...