arxiv: 2605.05693 · v2 · submitted 2026-05-07 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Saliency-Aware Regularized Quantization Calibration for Large Language Models

Yanlong Zhao , Xiaoyuan Cheng , Huihang Liu , Baihua He , Xinyu Zhang , Harrison Bo Hua Zhu , Wenlong Chen , Li Zeng

show 1 more author

Zhuo Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:52 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords post-training quantizationlarge language modelsquantization calibrationregularizationsaliencygeneralization

0 comments

The pith

Adding a regularizer to keep quantized weights close to their original floating-point values during calibration reduces generalization error in post-training quantization of large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard post-training quantization methods calibrate by minimizing reconstruction error on a small dataset, which can push quantized weights far from the original floating-point values and hurt performance on new data. The paper introduces Regularized Quantization Calibration (RQC) that adds a penalty term to the objective to control this deviation, then extends it to Saliency-Aware RQC (SARQC) that weights the penalty by how important each weight is. This change integrates into existing scale-search and Gram-based pipelines without changing inference speed. Experiments on dense and Mixture-of-Experts models show lower perplexity and higher zero-shot accuracy after the modification.

Core claim

The core claim is that existing calibration objectives based only on empirical reconstruction error over limited calibration data can increase the distance between quantized and original weights, raising generalization risk, and that explicitly adding a regularizer to minimize this distance (further modulated by saliency) produces quantized models with better downstream performance while remaining compatible with prior PTQ techniques.

What carries the argument

Regularized Quantization Calibration (RQC), a framework that augments the standard PTQ loss with an explicit term penalizing the deviation of quantized weights from the original floating-point weights, optionally weighted by saliency.

If this is right

Both scale-search-based and Gram-based calibration methods gain from the same unified regularization term.
The approach applies to both dense and Mixture-of-Experts architectures without extra inference cost.
Perplexity and zero-shot task accuracy improve consistently across tested models.
The regularizer can be added to existing PTQ code with minimal changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same deviation penalty might reduce sensitivity to the exact choice or size of the calibration dataset.
Because the regularizer acts only at calibration time, it could combine with other compression steps such as pruning or distillation.
If saliency weighting proves important, future work could test whether automatically learned saliency maps outperform the hand-crafted ones used here.

Load-bearing premise

That penalizing weight deviation from the original floating-point values on a small calibration set will reliably improve performance on unseen data without requiring per-model retuning or creating new instabilities.

What would settle it

Run the same PTQ pipelines on the same LLMs and calibration data with and without the added regularization term, then measure whether zero-shot accuracy drops by more than the reported gain when the term is removed.

Figures

Figures reproduced from arXiv: 2605.05693 by Baihua He, Harrison Bo Hua Zhu, Huihang Liu, Li Zeng, Wenlong Chen, Xiaoyuan Cheng, Xinyu Zhang, Yanlong Zhao, Zhuo Sun.

**Figure 1.** Figure 1: Illustration and validation of our motivation. view at source ↗

**Figure 2.** Figure 2: Ablation study. (a) Extension to OmniQuant: the y-axis shows downstream task accuracy, and the bars compare FP16/BF16, OmniQuant, and OmniQuant augmented with SARQC on the reported tasks for LLaMA2-7B and Mixtral-8x7B under W4A16. (b) Effect of calibration size: the x-axis shows the number of calibration samples, and the y-axis shows average downstream accuracy. The vertical dashed segments indicate the ac… view at source ↗

**Figure 3.** Figure 3: Effect of the regularization strength λ for SARQC-GS on LLaMA2-7B under INT4 weightonly quantization view at source ↗

**Figure 3.** Figure 3: Effect of the regularization strength λ for SARQC-GS on LLaMA2-7B under INT4 weightonly quantization [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the element-wise weight discrepancy view at source ↗

**Figure 4.** Figure 4: Visualization of the element-wise weight discrepancy [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗

read the original abstract

Post-training quantization (PTQ) is an effective approach for deploying large language models (LLMs) under memory and latency constraints. Most existing PTQ methods determine quantization parameters by minimizing a layer-wise reconstruction error on a predetermined calibration dataset, typically optimized via either scale search or Gram-based methods. However, from the perspective of generalization risk, existing PTQ calibration objectives based solely on empirical reconstruction error over limited or unrepresentative calibration data may move the quantized weights away from the original floating-point weights, potentially degrading downstream performance. To address this issue, we propose \emph{Regularized Quantization Calibration} (RQC), a unified framework that augments standard PTQ objectives with a regularizer that explicitly controls weight deviation from the original weights. We further generalize this framework to incorporate a saliency-aware regularizer, resulting in \emph{Saliency-Aware Regularized Quantization Calibration} (SARQC). The proposed regularization encourages quantized weights to remain close to the original weights during calibration, leading to improved generalization at inference time. SARQC integrates seamlessly into existing PTQ pipelines and enhances both scale-search-based and Gram-based methods under a unified formulation. Extensive experiments on dense and Mixture-of-Experts LLMs demonstrate consistent improvements in perplexity and zero-shot accuracy, without introducing additional inference overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a weight-deviation regularizer to standard PTQ calibration and claims this plus a saliency version improves generalization on LLMs without extra cost, but the bias risk from using the same small calibration data for everything needs checking.

read the letter

The core move here is to augment existing PTQ objectives with an explicit penalty on how far the quantized weights stray from the original floating-point ones. RQC does this in a unified way that covers both scale-search and Gram-based calibration, and SARQC weights the penalty by saliency. The abstract says this leads to better perplexity and zero-shot accuracy on dense and MoE models while adding no inference overhead.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Regularized Quantization Calibration (RQC) and its saliency-aware variant SARQC for post-training quantization of LLMs. Standard PTQ objectives (scale-search or Gram-based) that minimize layer-wise reconstruction error on limited calibration data are augmented with an explicit regularizer penalizing deviation of quantized weights from the original FP32 weights; the saliency-aware version weights this term by a saliency map. The central claim is that this prevents the quantized solution from drifting away from the FP32 solution, yielding improved generalization measured by lower perplexity and higher zero-shot accuracy on both dense and MoE models, with no added inference cost and seamless integration into existing pipelines.

Significance. If the empirical gains prove robust, the work supplies a lightweight, unified regularization prior that directly targets the generalization risk inherent in empirical-risk minimization over small calibration sets. Because the modification is confined to the calibration stage and adds no runtime overhead, it could be adopted as a drop-in enhancement for a wide range of existing PTQ algorithms.

major comments (2)

[Abstract] Abstract: the claim of 'consistent improvements in perplexity and zero-shot accuracy' is asserted without any numerical values, tables, error bars, or description of calibration-set size/selection. This absence is load-bearing because the paper's contribution rests entirely on the magnitude and reliability of these gains; without them the reader cannot judge whether the regularizer delivers meaningful benefit or merely marginal noise.
[Method (regularizer definition)] The formulation of the regularizer (both plain L2 and saliency-weighted) is derived from the same limited calibration data used for the reconstruction loss. Because the saliency map and Gram matrix are estimated on this data, the added quadratic term can amplify whatever distribution shift exists between calibration and test distributions rather than acting as a true generalization prior. The manuscript should therefore contain an explicit ablation (e.g., cross-calibration-set experiments or measurement of weight deviation on held-out data) showing that the regularizer does not reinforce calibration-set biases; no such analysis is referenced in the abstract or skeptic summary.

minor comments (2)

[Abstract] The abstract states that SARQC 'enhances both scale-search-based and Gram-based methods under a unified formulation' but does not name the concrete baselines (e.g., GPTQ, AWQ, or specific Gram solvers). Adding these references would clarify the scope of the claimed improvement.
Notation for the saliency map and the regularization strength hyper-parameter should be introduced once and used consistently; currently the abstract uses 'saliency-aware regularizer' without defining the symbol or its estimation procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our work. We address the major comments point by point below, indicating the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'consistent improvements in perplexity and zero-shot accuracy' is asserted without any numerical values, tables, error bars, or description of calibration-set size/selection. This absence is load-bearing because the paper's contribution rests entirely on the magnitude and reliability of these gains; without them the reader cannot judge whether the regularizer delivers meaningful benefit or merely marginal noise.

Authors: We agree with the referee that including specific numerical evidence in the abstract would strengthen the presentation and allow readers to better evaluate the practical impact of SARQC. In the revised manuscript, we will modify the abstract to incorporate key quantitative results from our experiments, such as the average reductions in perplexity and improvements in zero-shot accuracy across the evaluated models. We will also briefly note the calibration dataset details, including size and selection method, to provide necessary context. revision: yes
Referee: [Method (regularizer definition)] The formulation of the regularizer (both plain L2 and saliency-weighted) is derived from the same limited calibration data used for the reconstruction loss. Because the saliency map and Gram matrix are estimated on this data, the added quadratic term can amplify whatever distribution shift exists between calibration and test distributions rather than acting as a true generalization prior. The manuscript should therefore contain an explicit ablation (e.g., cross-calibration-set experiments or measurement of weight deviation on held-out data) showing that the regularizer does not reinforce calibration-set biases; no such analysis is referenced in the abstract or skeptic summary.

Authors: This is a valid concern about the potential for the regularizer to overfit to the calibration data. To address this, we will include an additional ablation study in the revised manuscript. Specifically, we will perform cross-calibration-set experiments where the saliency map and regularizer are computed using one calibration dataset and the quantization is evaluated on models using a different calibration set. Additionally, we will report measurements of weight deviation on held-out data to demonstrate that the regularizer promotes closeness to the original weights without amplifying biases. We believe these additions will clarify that the improvements stem from better generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the SARQC derivation chain

full rationale

The paper augments standard PTQ reconstruction objectives (scale-search or Gram-based) with an explicit regularization term that penalizes deviation of quantized weights from the original FP32 weights, optionally weighted by a saliency map. This is presented as a direct, additive modification to the calibration loss rather than a quantity derived from or equivalent to the inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations to prior author work appear in the abstract or described framework. The method is self-contained as an empirical extension to existing pipelines, with performance claims resting on downstream experiments rather than tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that limited calibration data causes existing PTQ methods to overfit and degrade generalization; the paper introduces a new regularizer term whose strength and saliency computation are not detailed in the abstract.

axioms (1)

domain assumption Existing PTQ calibration based solely on empirical reconstruction error over limited data moves quantized weights away from originals and degrades downstream performance.
Explicitly stated in the abstract as the motivation for adding regularization.

pith-pipeline@v0.9.0 · 5555 in / 1306 out tokens · 63181 ms · 2026-05-11T00:52:30.904274+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J is the unique calibrated reciprocal cost) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

min ĉWₗ∈Q ∥WₗXₗ − ĉWₗXₗ∥²_F + λ∥(ĉWₗ − Wₗ)Sₗ∥²_F (Eq. 8); Gₗ := XₗXₗᵀ + λ SₗSₗᵀ (Eq. 10)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1: generalization bound R(ĉWₗ) − bR_cal(ĉWₗ) ≤ R²M_X² √(log 2|Q_R|/δ / 2n) controlled by drift radius R

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 1 internal anchor

[1]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems, volume 37, pages 100213–100240. Curran Associates, Inc., 2024

work page 2024
[2]

Post training 4-bit quantization of convolu- tional networks for rapid-deployment

Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolu- tional networks for rapid-deployment. InAdvances in Neural Information Processing Systems, volume 32, pages 7950–7958. Curran Associates, Inc., 2019

work page 2019
[3]

Cambridge University Press, Cambridge, 2004

Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, Cambridge, 2004

work page 2004
[4]

Mikhail A. Bragin. Survey on Lagrangian relaxation for MILP: Importance, challenges, historical review, recent advancements, and opportunities.Annals of Operations Research, 333(1):29–45, 2024

work page 2024
[5]

Regularized calibration with successive rounding for post-training quantization.arXiv preprint arXiv:2602.05902, 2026

Seohyeon Cha, Huancheng Chen, Dongjun Kim, Haoran Zhang, Kevin Chan, Gustavo de Ve- ciana, and Haris Vikalo. Regularized calibration with successive rounding for post-training quantization.arXiv preprint arXiv:2602.05902, 2026

work page internal anchor Pith review arXiv 2026
[6]

QuIP: 2-bit quantization of large language models with guarantees

Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InAdvances in Neural Information Processing Systems, volume 36, pages 4396–4429. Curran Associates, Inc., 2023

work page 2023
[7]

Evaluating large language models trained on code, 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, et al. Evaluating large language models trained on code, 2021

work page 2021
[8]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

work page 2019
[9]

Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018

work page 2018
[10]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models, 2024

work page 2024
[11]

Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh

Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. InThe Twelfth International Confer- ence on Learning Representations, 2024

work page 2024
[12]

Di Pillo and L

G. Di Pillo and L. Grippo. Exact penalty functions in constrained optimization.SIAM Journal on Control and Optimization, 27(6):1333–1360, 1989. 10

work page 1989
[13]

Nonsmooth exact penalty methods for equality-constrained optimization: Complexity and implementation.SIAM Journal on Optimization, 36(2):626–650, 2026

Youssef Diouane, Maxence Gollier, and Dominique Orban. Nonsmooth exact penalty methods for equality-constrained optimization: Complexity and implementation.SIAM Journal on Optimization, 36(2):626–650, 2026

work page 2026
[14]

OAC: output-adaptive calibration for accurate post-training quantization

Ali Edalati, Alireza Ghaffari, Mahsa Ghazvini Nejad, Lu Hou, Boxing Chen, Masoud Asgharian, and Vahid Partovi Nia. OAC: output-adaptive calibration for accurate post-training quantization. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fif...

work page 2025
[15]

Springer, Berlin, Heidelberg, 2 edition, 2005

Matthias Ehrgott.Multicriteria Optimization. Springer, Berlin, Heidelberg, 2 edition, 2005

work page 2005
[16]

Generalized lagrange multiplier method for solving problems of optimum allocation of resources.Operations Research, 11(3):399–417, 1963

Hugh Everett, III. Generalized lagrange multiplier method for solving problems of optimum allocation of resources.Operations Research, 11(3):399–417, 1963

work page 1963
[17]

Marshall L. Fisher. The Lagrangian relaxation method for solving integer programming problems.Management Science, 27(1):1–18, 1981

work page 1981
[18]

GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2023

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2023

work page 2023
[19]

Optimal brain compression: A framework for accurate post-training quantization and pruning

Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning. InAdvances in Neural Information Processing Systems, volume 35, pages 4475–4488, 2022

work page 2022
[20]

Geoffrion

Arthur M. Geoffrion. Proper efficiency and the theory of vector maximization.Journal of Mathematical Analysis and Applications, 22(3):618–630, 1968

work page 1968
[21]

Geoffrion

Arthur M. Geoffrion. Lagrangean relaxation for integer programming. In M. L. Balinski, editor, Approaches to Integer Programming, volume 2 ofMathematical Programming Studies, pages 82–114. Springer, Berlin, Heidelberg, 1974

work page 1974
[22]

Rethinking post-training quantization: Introducing a statistical pre-calibration approach

Alireza Ghaffari, Sharareh Younesian, Boxing Chen, Vahid Partovi Nia, and Masoud Asgharian. Rethinking post-training quantization: Introducing a statistical pre-calibration approach. In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods, pages 159–169. SciTePress, 2025

work page 2025
[23]

Mahoney, and Kurt Keutzer

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference, 2021

work page 2021
[24]

The Llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The Llama 3 herd of models, 2024

work page 2024
[25]

Han and O

S.-P. Han and O. L. Mangasarian. Exact penalty functions in nonlinear programming.Mathe- matical Programming, 17:251–269, 1979

work page 1979
[26]

Stork, and Gregory Wolff

Babak Hassibi, David G. Stork, and Gregory Wolff. Optimal brain surgeon: Extensions and performance comparisons. InAdvances in Neural Information Processing Systems, volume 6. Morgan-Kaufmann, 1993

work page 1993
[27]

Using scalarizations for the approximation of multiobjective optimization problems: Towards a general theory

Stephan Helfrich, Arne Herzel, Stefan Ruzika, and Clemens Thielen. Using scalarizations for the approximation of multiobjective optimization problems: Towards a general theory. Mathematical Methods of Operations Research, 100:27–63, 2024

work page 2024
[28]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021
[29]

Hoerl and Robert W

Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

work page 1970
[30]

Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, et al

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, et al. Mistral 7B, 2023. 11

work page 2023
[31]

Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, et al

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, et al. Mixtral of experts, 2024

work page 2024
[32]

Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens J. S. Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, and Hyun Oh Song. GuidedQuant: Large language model quantization via exploiting end loss guidance. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 30011–30037. PMLR, 2025

work page 2025
[33]

On supportedness in multi-objective integer linear programming.Journal of Multi-Criteria Decision Analysis, 32(3):e70024, 2025

David Könen and Michael Stiglmayr. On supportedness in multi-objective integer linear programming.Journal of Multi-Criteria Decision Analysis, 32(3):e70024, 2025

work page 2025
[34]

Denker, and Sara A

Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InAdvances in Neural Information Processing Systems, volume 2, pages 598–605, 1989

work page 1989
[35]

OWQ: Outlier- aware weight quantization for efficient fine-tuning and inference of large language models

Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. OWQ: Outlier- aware weight quantization for efficient fine-tuning and inference of large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(12):13355–13364, 2024

work page 2024
[36]

BRECQ: Pushing the limit of post-training quantization by block reconstruction

Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. BRECQ: Pushing the limit of post-training quantization by block reconstruction. InInternational Conference on Learning Representations, 2021

work page 2021
[37]

GPTAQ: Efficient finetuning-free quantization for asymmetric calibration

Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda. GPTAQ: Efficient finetuning-free quantization for asymmetric calibration. InProceedings of the 42nd In- ternational Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 36690–36706. PMLR, 2025

work page 2025
[38]

ParoQuant: Pairwise rotation quantization for efficient reasoning LLM inference

Yesheng Liang, Haisheng Chen, Song Han, and Zhijian Liu. ParoQuant: Pairwise rotation quantization for efficient reasoning LLM inference. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[39]

Haitao Liao, Xujing Yuan, and Ruxin Gao. An exact penalty function optimization method and its application in stress constrained topology optimization and scenario based reliability design problems.Applied Mathematical Modelling, 125:260–292, 2024

work page 2024
[40]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. InProceedings of Machine Learning and Systems, volume 6, pages 87–100, 2024

work page 2024
[41]

PD- Quant: Post-training quantization based on prediction difference metric

Jiawei Liu, Lin Niu, Zhihang Yuan, Dawei Yang, Xinggang Wang, and Wenyu Liu. PD- Quant: Post-training quantization based on prediction difference metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24427–24437, 2023

work page 2023
[42]

LLM-QAT: Data-free quan- tization aware training for large language models

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quan- tization aware training for large language models. InFindings of the Association for Compu- tational Linguistics: ACL 2024, pages 467–484, Bangkok, Thailand, 2024. Association for Computat...

work page 2024
[43]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[44]

Exact penalty functions for nonlinear integer program- ming problems.Journal of Optimization Theory and Applications, 145(3):479–488, 2010

Stefano Lucidi and Francesco Rinaldi. Exact penalty functions for nonlinear integer program- ming problems.Journal of Optimization Theory and Applications, 145(3):479–488, 2010

work page 2010
[45]

Timothy Marler and Jasbir S

R. Timothy Marler and Jasbir S. Arora. The weighted sum method for multi-objective optimiza- tion: New insights.Structural and Multidisciplinary Optimization, 41(6):853–862, 2010. 12

work page 2010
[46]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2017

work page 2017
[47]

Springer, New York, 1998

Kaisa Miettinen.Nonlinear Multiobjective Optimization, volume 12 ofInternational Series in Operations Research & Management Science. Springer, New York, 1998

work page 1998
[48]

Up or down? Adaptive rounding for post-training quantization

Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? Adaptive rounding for post-training quantization. InProceedings of the 37th In- ternational Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 7197–7206. PMLR, 2020

work page 2020
[49]

A white paper on neural network quantization, 2021

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization, 2021

work page 2021
[50]

Wright.Numerical Optimization

Jorge Nocedal and Stephen J. Wright.Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, 2 edition, 2006

work page 2006
[51]

GPT-4 technical report, 2023

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, et al. GPT-4 technical report, 2023

work page 2023
[52]

WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021
[53]

Enhancing post-training quantization calibration through contrastive learning

Yuzhang Shang, Gaowen Liu, Ramana Rao Kompella, and Yan Yan. Enhancing post-training quantization calibration through contrastive learning. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 15921–15930, 2024

work page 2024
[54]

OmniQuant: Omnidirectionally calibrated quan- tization for large language models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quan- tization for large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[55]

PocketLLM: Ultimate compression of large language models via meta networks, 2025

Ye Tian, Chengcheng Wang, Jing Han, Yehui Tang, and Kai Han. PocketLLM: Ultimate compression of large language models via meta networks, 2025

work page 2025
[56]

Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996

Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996

work page 1996
[57]

Llama 2: Open foundation and fine-tuned chat models, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, et al. Llama 2: Open foundation and fine-tuned chat models, 2023

work page 2023
[58]

GPTVQ: The blessing of dimensionality for LLM quantization

Mart Van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Artem Bolshakov, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, and Paul Whatmough. GPTVQ: The blessing of dimensionality for LLM quantization. InWorkshop on Efficient Systems for Foundation Models II @ ICML 2024, 2024

work page 2024
[59]

MPPQ: Enhancing post-training quantization for LLMs via mixed supervision, proxy rounding, and pre-searching

Mingrun Wei, Yeyu Yan, and Dong Wang. MPPQ: Enhancing post-training quantization for LLMs via mixed supervision, proxy rounding, and pre-searching. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 8277–8285, 2025

work page 2025
[60]

Outlier suppression: Pushing the limit of low-bit transformer language models

Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. InAdvances in Neural Information Processing Systems, volume 35, pages 17402–17414. Curran Associates, Inc., 2022

work page 2022
[61]

Easyquant: Post- training quantization via scale optimization, 2020

Di Wu, Qi Tang, Yongle Zhao, Ming Zhang, Ying Fu, and Debing Zhang. Easyquant: Post- training quantization via scale optimization, 2020

work page 2020
[62]

Training transformers with 4-bit integers

Haocheng Xi, Changhao Li, Jianfei Chen, and Jun Zhu. Training transformers with 4-bit integers. InAdvances in Neural Information Processing Systems, volume 36, pages 49146–49168. Curran Associates, Inc., 2023. 13

work page 2023
[63]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 38087–38099. PMLR, 2023

work page 2023
[64]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report, 2025

work page 2025
[65]

Lotfi A. Zadeh. Optimality and non-scalar-valued performance criteria.IEEE Transactions on Automatic Control, 8(1):59–60, 1963

work page 1963
[66]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. 14 Appendix Appendix Overview This appendix offers...

work page 2019
[67]

22 Since0≤ℓ cWl (X)≤R 2M2 X, Hoeffding’s inequality gives, for any fixedcWl ∈ Q R, Pr R(cWl)− bRcal(cWl) ≥t ≤2 exp − 2nt2 R4M4 X

Then for any fixedcWl ∈ Q R, R(cWl) =E[ℓ cWl (X)], and bRcal(cWl) = 1 n Pn i=1 ℓ cWl (Xl,i). 21 Since0≤ℓ cWl (X)≤R 2M2 X, Hoeffding’s inequality gives, for any fixedcWl ∈ Q R, Pr R(cWl)− bRcal(cWl) ≥t ≤2 exp − 2nt2 R4M4 X . SinceQ R is finite, we apply the union bound: Pr ∃cWl ∈ Q R : R(cWl)− bRcal(cWl) ≥t ≤2|Q R|exp − 2nt2 R4M4 X . Setting the right-hand...

work page arXiv 2048