Recognition: 2 theorem links
· Lean TheoremSaliency-Aware Regularized Quantization Calibration for Large Language Models
Pith reviewed 2026-05-11 00:52 UTC · model grok-4.3
The pith
Adding a regularizer to keep quantized weights close to their original floating-point values during calibration reduces generalization error in post-training quantization of large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The core claim is that existing calibration objectives based only on empirical reconstruction error over limited calibration data can increase the distance between quantized and original weights, raising generalization risk, and that explicitly adding a regularizer to minimize this distance (further modulated by saliency) produces quantized models with better downstream performance while remaining compatible with prior PTQ techniques.
What carries the argument
Regularized Quantization Calibration (RQC), a framework that augments the standard PTQ loss with an explicit term penalizing the deviation of quantized weights from the original floating-point weights, optionally weighted by saliency.
If this is right
- Both scale-search-based and Gram-based calibration methods gain from the same unified regularization term.
- The approach applies to both dense and Mixture-of-Experts architectures without extra inference cost.
- Perplexity and zero-shot task accuracy improve consistently across tested models.
- The regularizer can be added to existing PTQ code with minimal changes.
Where Pith is reading between the lines
- The same deviation penalty might reduce sensitivity to the exact choice or size of the calibration dataset.
- Because the regularizer acts only at calibration time, it could combine with other compression steps such as pruning or distillation.
- If saliency weighting proves important, future work could test whether automatically learned saliency maps outperform the hand-crafted ones used here.
Load-bearing premise
That penalizing weight deviation from the original floating-point values on a small calibration set will reliably improve performance on unseen data without requiring per-model retuning or creating new instabilities.
What would settle it
Run the same PTQ pipelines on the same LLMs and calibration data with and without the added regularization term, then measure whether zero-shot accuracy drops by more than the reported gain when the term is removed.
Figures
read the original abstract
Post-training quantization (PTQ) is an effective approach for deploying large language models (LLMs) under memory and latency constraints. Most existing PTQ methods determine quantization parameters by minimizing a layer-wise reconstruction error on a predetermined calibration dataset, typically optimized via either scale search or Gram-based methods. However, from the perspective of generalization risk, existing PTQ calibration objectives based solely on empirical reconstruction error over limited or unrepresentative calibration data may move the quantized weights away from the original floating-point weights, potentially degrading downstream performance. To address this issue, we propose \emph{Regularized Quantization Calibration} (RQC), a unified framework that augments standard PTQ objectives with a regularizer that explicitly controls weight deviation from the original weights. We further generalize this framework to incorporate a saliency-aware regularizer, resulting in \emph{Saliency-Aware Regularized Quantization Calibration} (SARQC). The proposed regularization encourages quantized weights to remain close to the original weights during calibration, leading to improved generalization at inference time. SARQC integrates seamlessly into existing PTQ pipelines and enhances both scale-search-based and Gram-based methods under a unified formulation. Extensive experiments on dense and Mixture-of-Experts LLMs demonstrate consistent improvements in perplexity and zero-shot accuracy, without introducing additional inference overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Regularized Quantization Calibration (RQC) and its saliency-aware variant SARQC for post-training quantization of LLMs. Standard PTQ objectives (scale-search or Gram-based) that minimize layer-wise reconstruction error on limited calibration data are augmented with an explicit regularizer penalizing deviation of quantized weights from the original FP32 weights; the saliency-aware version weights this term by a saliency map. The central claim is that this prevents the quantized solution from drifting away from the FP32 solution, yielding improved generalization measured by lower perplexity and higher zero-shot accuracy on both dense and MoE models, with no added inference cost and seamless integration into existing pipelines.
Significance. If the empirical gains prove robust, the work supplies a lightweight, unified regularization prior that directly targets the generalization risk inherent in empirical-risk minimization over small calibration sets. Because the modification is confined to the calibration stage and adds no runtime overhead, it could be adopted as a drop-in enhancement for a wide range of existing PTQ algorithms.
major comments (2)
- [Abstract] Abstract: the claim of 'consistent improvements in perplexity and zero-shot accuracy' is asserted without any numerical values, tables, error bars, or description of calibration-set size/selection. This absence is load-bearing because the paper's contribution rests entirely on the magnitude and reliability of these gains; without them the reader cannot judge whether the regularizer delivers meaningful benefit or merely marginal noise.
- [Method (regularizer definition)] The formulation of the regularizer (both plain L2 and saliency-weighted) is derived from the same limited calibration data used for the reconstruction loss. Because the saliency map and Gram matrix are estimated on this data, the added quadratic term can amplify whatever distribution shift exists between calibration and test distributions rather than acting as a true generalization prior. The manuscript should therefore contain an explicit ablation (e.g., cross-calibration-set experiments or measurement of weight deviation on held-out data) showing that the regularizer does not reinforce calibration-set biases; no such analysis is referenced in the abstract or skeptic summary.
minor comments (2)
- [Abstract] The abstract states that SARQC 'enhances both scale-search-based and Gram-based methods under a unified formulation' but does not name the concrete baselines (e.g., GPTQ, AWQ, or specific Gram solvers). Adding these references would clarify the scope of the claimed improvement.
- Notation for the saliency map and the regularization strength hyper-parameter should be introduced once and used consistently; currently the abstract uses 'saliency-aware regularizer' without defining the symbol or its estimation procedure.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments on our work. We address the major comments point by point below, indicating the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'consistent improvements in perplexity and zero-shot accuracy' is asserted without any numerical values, tables, error bars, or description of calibration-set size/selection. This absence is load-bearing because the paper's contribution rests entirely on the magnitude and reliability of these gains; without them the reader cannot judge whether the regularizer delivers meaningful benefit or merely marginal noise.
Authors: We agree with the referee that including specific numerical evidence in the abstract would strengthen the presentation and allow readers to better evaluate the practical impact of SARQC. In the revised manuscript, we will modify the abstract to incorporate key quantitative results from our experiments, such as the average reductions in perplexity and improvements in zero-shot accuracy across the evaluated models. We will also briefly note the calibration dataset details, including size and selection method, to provide necessary context. revision: yes
-
Referee: [Method (regularizer definition)] The formulation of the regularizer (both plain L2 and saliency-weighted) is derived from the same limited calibration data used for the reconstruction loss. Because the saliency map and Gram matrix are estimated on this data, the added quadratic term can amplify whatever distribution shift exists between calibration and test distributions rather than acting as a true generalization prior. The manuscript should therefore contain an explicit ablation (e.g., cross-calibration-set experiments or measurement of weight deviation on held-out data) showing that the regularizer does not reinforce calibration-set biases; no such analysis is referenced in the abstract or skeptic summary.
Authors: This is a valid concern about the potential for the regularizer to overfit to the calibration data. To address this, we will include an additional ablation study in the revised manuscript. Specifically, we will perform cross-calibration-set experiments where the saliency map and regularizer are computed using one calibration dataset and the quantization is evaluated on models using a different calibration set. Additionally, we will report measurements of weight deviation on held-out data to demonstrate that the regularizer promotes closeness to the original weights without amplifying biases. We believe these additions will clarify that the improvements stem from better generalization. revision: yes
Circularity Check
No significant circularity in the SARQC derivation chain
full rationale
The paper augments standard PTQ reconstruction objectives (scale-search or Gram-based) with an explicit regularization term that penalizes deviation of quantized weights from the original FP32 weights, optionally weighted by a saliency map. This is presented as a direct, additive modification to the calibration loss rather than a quantity derived from or equivalent to the inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations to prior author work appear in the abstract or described framework. The method is self-contained as an empirical extension to existing pipelines, with performance claims resting on downstream experiments rather than tautological reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing PTQ calibration based solely on empirical reconstruction error over limited data moves quantized weights away from originals and degrades downstream performance.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J is the unique calibrated reciprocal cost) echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
min ĉWₗ∈Q ∥WₗXₗ − ĉWₗXₗ∥²_F + λ∥(ĉWₗ − Wₗ)Sₗ∥²_F (Eq. 8); Gₗ := XₗXₗᵀ + λ SₗSₗᵀ (Eq. 10)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.1: generalization bound R(ĉWₗ) − bR_cal(ĉWₗ) ≤ R²M_X² √(log 2|Q_R|/δ / 2n) controlled by drift radius R
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems, volume 37, pages 100213–100240. Curran Associates, Inc., 2024
work page 2024
-
[2]
Post training 4-bit quantization of convolu- tional networks for rapid-deployment
Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolu- tional networks for rapid-deployment. InAdvances in Neural Information Processing Systems, volume 32, pages 7950–7958. Curran Associates, Inc., 2019
work page 2019
-
[3]
Cambridge University Press, Cambridge, 2004
Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, Cambridge, 2004
work page 2004
-
[4]
Mikhail A. Bragin. Survey on Lagrangian relaxation for MILP: Importance, challenges, historical review, recent advancements, and opportunities.Annals of Operations Research, 333(1):29–45, 2024
work page 2024
-
[5]
Seohyeon Cha, Huancheng Chen, Dongjun Kim, Haoran Zhang, Kevin Chan, Gustavo de Ve- ciana, and Haris Vikalo. Regularized calibration with successive rounding for post-training quantization.arXiv preprint arXiv:2602.05902, 2026
work page internal anchor Pith review arXiv 2026
-
[6]
QuIP: 2-bit quantization of large language models with guarantees
Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InAdvances in Neural Information Processing Systems, volume 36, pages 4396–4429. Curran Associates, Inc., 2023
work page 2023
-
[7]
Evaluating large language models trained on code, 2021
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, et al. Evaluating large language models trained on code, 2021
work page 2021
-
[8]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...
work page 2019
-
[9]
Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018
work page 2018
-
[10]
Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models, 2024
work page 2024
-
[11]
Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. InThe Twelfth International Confer- ence on Learning Representations, 2024
work page 2024
-
[12]
G. Di Pillo and L. Grippo. Exact penalty functions in constrained optimization.SIAM Journal on Control and Optimization, 27(6):1333–1360, 1989. 10
work page 1989
-
[13]
Youssef Diouane, Maxence Gollier, and Dominique Orban. Nonsmooth exact penalty methods for equality-constrained optimization: Complexity and implementation.SIAM Journal on Optimization, 36(2):626–650, 2026
work page 2026
-
[14]
OAC: output-adaptive calibration for accurate post-training quantization
Ali Edalati, Alireza Ghaffari, Mahsa Ghazvini Nejad, Lu Hou, Boxing Chen, Masoud Asgharian, and Vahid Partovi Nia. OAC: output-adaptive calibration for accurate post-training quantization. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fif...
work page 2025
-
[15]
Springer, Berlin, Heidelberg, 2 edition, 2005
Matthias Ehrgott.Multicriteria Optimization. Springer, Berlin, Heidelberg, 2 edition, 2005
work page 2005
-
[16]
Hugh Everett, III. Generalized lagrange multiplier method for solving problems of optimum allocation of resources.Operations Research, 11(3):399–417, 1963
work page 1963
-
[17]
Marshall L. Fisher. The Lagrangian relaxation method for solving integer programming problems.Management Science, 27(1):1–18, 1981
work page 1981
-
[18]
GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2023
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2023
work page 2023
-
[19]
Optimal brain compression: A framework for accurate post-training quantization and pruning
Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning. InAdvances in Neural Information Processing Systems, volume 35, pages 4475–4488, 2022
work page 2022
- [20]
- [21]
-
[22]
Rethinking post-training quantization: Introducing a statistical pre-calibration approach
Alireza Ghaffari, Sharareh Younesian, Boxing Chen, Vahid Partovi Nia, and Masoud Asgharian. Rethinking post-training quantization: Introducing a statistical pre-calibration approach. In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods, pages 159–169. SciTePress, 2025
work page 2025
-
[23]
Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference, 2021
work page 2021
-
[24]
The Llama 3 herd of models, 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The Llama 3 herd of models, 2024
work page 2024
- [25]
-
[26]
Babak Hassibi, David G. Stork, and Gregory Wolff. Optimal brain surgeon: Extensions and performance comparisons. InAdvances in Neural Information Processing Systems, volume 6. Morgan-Kaufmann, 1993
work page 1993
-
[27]
Stephan Helfrich, Arne Herzel, Stefan Ruzika, and Clemens Thielen. Using scalarizations for the approximation of multiobjective optimization problems: Towards a general theory. Mathematical Methods of Operations Research, 100:27–63, 2024
work page 2024
-
[28]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021
work page 2021
-
[29]
Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970
work page 1970
-
[30]
Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, et al
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, et al. Mistral 7B, 2023. 11
work page 2023
-
[31]
Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, et al
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, et al. Mixtral of experts, 2024
work page 2024
-
[32]
Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens J. S. Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, and Hyun Oh Song. GuidedQuant: Large language model quantization via exploiting end loss guidance. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 30011–30037. PMLR, 2025
work page 2025
-
[33]
David Könen and Michael Stiglmayr. On supportedness in multi-objective integer linear programming.Journal of Multi-Criteria Decision Analysis, 32(3):e70024, 2025
work page 2025
-
[34]
Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InAdvances in Neural Information Processing Systems, volume 2, pages 598–605, 1989
work page 1989
-
[35]
Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. OWQ: Outlier- aware weight quantization for efficient fine-tuning and inference of large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(12):13355–13364, 2024
work page 2024
-
[36]
BRECQ: Pushing the limit of post-training quantization by block reconstruction
Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. BRECQ: Pushing the limit of post-training quantization by block reconstruction. InInternational Conference on Learning Representations, 2021
work page 2021
-
[37]
GPTAQ: Efficient finetuning-free quantization for asymmetric calibration
Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda. GPTAQ: Efficient finetuning-free quantization for asymmetric calibration. InProceedings of the 42nd In- ternational Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 36690–36706. PMLR, 2025
work page 2025
-
[38]
ParoQuant: Pairwise rotation quantization for efficient reasoning LLM inference
Yesheng Liang, Haisheng Chen, Song Han, and Zhijian Liu. ParoQuant: Pairwise rotation quantization for efficient reasoning LLM inference. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[39]
Haitao Liao, Xujing Yuan, and Ruxin Gao. An exact penalty function optimization method and its application in stress constrained topology optimization and scenario based reliability design problems.Applied Mathematical Modelling, 125:260–292, 2024
work page 2024
-
[40]
Awq: Activation-aware weight quantization for on-device llm compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. InProceedings of Machine Learning and Systems, volume 6, pages 87–100, 2024
work page 2024
-
[41]
PD- Quant: Post-training quantization based on prediction difference metric
Jiawei Liu, Lin Niu, Zhihang Yuan, Dawei Yang, Xinggang Wang, and Wenyu Liu. PD- Quant: Post-training quantization based on prediction difference metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24427–24437, 2023
work page 2023
-
[42]
LLM-QAT: Data-free quan- tization aware training for large language models
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quan- tization aware training for large language models. InFindings of the Association for Compu- tational Linguistics: ACL 2024, pages 467–484, Bangkok, Thailand, 2024. Association for Computat...
work page 2024
-
[43]
SpinQuant: LLM quantization with learned rotations
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[44]
Stefano Lucidi and Francesco Rinaldi. Exact penalty functions for nonlinear integer program- ming problems.Journal of Optimization Theory and Applications, 145(3):479–488, 2010
work page 2010
-
[45]
R. Timothy Marler and Jasbir S. Arora. The weighted sum method for multi-objective optimiza- tion: New insights.Structural and Multidisciplinary Optimization, 41(6):853–862, 2010. 12
work page 2010
-
[46]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2017
work page 2017
-
[47]
Kaisa Miettinen.Nonlinear Multiobjective Optimization, volume 12 ofInternational Series in Operations Research & Management Science. Springer, New York, 1998
work page 1998
-
[48]
Up or down? Adaptive rounding for post-training quantization
Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? Adaptive rounding for post-training quantization. InProceedings of the 37th In- ternational Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 7197–7206. PMLR, 2020
work page 2020
-
[49]
A white paper on neural network quantization, 2021
Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization, 2021
work page 2021
-
[50]
Jorge Nocedal and Stephen J. Wright.Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, 2 edition, 2006
work page 2006
-
[51]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, et al. GPT-4 technical report, 2023
work page 2023
-
[52]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
work page 2021
-
[53]
Enhancing post-training quantization calibration through contrastive learning
Yuzhang Shang, Gaowen Liu, Ramana Rao Kompella, and Yan Yan. Enhancing post-training quantization calibration through contrastive learning. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 15921–15930, 2024
work page 2024
-
[54]
OmniQuant: Omnidirectionally calibrated quan- tization for large language models
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quan- tization for large language models. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[55]
PocketLLM: Ultimate compression of large language models via meta networks, 2025
Ye Tian, Chengcheng Wang, Jing Han, Yehui Tang, and Kai Han. PocketLLM: Ultimate compression of large language models via meta networks, 2025
work page 2025
-
[56]
Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996
work page 1996
-
[57]
Llama 2: Open foundation and fine-tuned chat models, 2023
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, et al. Llama 2: Open foundation and fine-tuned chat models, 2023
work page 2023
-
[58]
GPTVQ: The blessing of dimensionality for LLM quantization
Mart Van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Artem Bolshakov, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, and Paul Whatmough. GPTVQ: The blessing of dimensionality for LLM quantization. InWorkshop on Efficient Systems for Foundation Models II @ ICML 2024, 2024
work page 2024
-
[59]
Mingrun Wei, Yeyu Yan, and Dong Wang. MPPQ: Enhancing post-training quantization for LLMs via mixed supervision, proxy rounding, and pre-searching. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 8277–8285, 2025
work page 2025
-
[60]
Outlier suppression: Pushing the limit of low-bit transformer language models
Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. InAdvances in Neural Information Processing Systems, volume 35, pages 17402–17414. Curran Associates, Inc., 2022
work page 2022
-
[61]
Easyquant: Post- training quantization via scale optimization, 2020
Di Wu, Qi Tang, Yongle Zhao, Ming Zhang, Ying Fu, and Debing Zhang. Easyquant: Post- training quantization via scale optimization, 2020
work page 2020
-
[62]
Training transformers with 4-bit integers
Haocheng Xi, Changhao Li, Jianfei Chen, and Jun Zhu. Training transformers with 4-bit integers. InAdvances in Neural Information Processing Systems, volume 36, pages 49146–49168. Curran Associates, Inc., 2023. 13
work page 2023
-
[63]
SmoothQuant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 38087–38099. PMLR, 2023
work page 2023
-
[64]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report, 2025
work page 2025
-
[65]
Lotfi A. Zadeh. Optimality and non-scalar-valued performance criteria.IEEE Transactions on Automatic Control, 8(1):59–60, 1963
work page 1963
-
[66]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. 14 Appendix Appendix Overview This appendix offers...
work page 2019
-
[67]
Then for any fixedcWl ∈ Q R, R(cWl) =E[ℓ cWl (X)], and bRcal(cWl) = 1 n Pn i=1 ℓ cWl (Xl,i). 21 Since0≤ℓ cWl (X)≤R 2M2 X, Hoeffding’s inequality gives, for any fixedcWl ∈ Q R, Pr R(cWl)− bRcal(cWl) ≥t ≤2 exp − 2nt2 R4M4 X . SinceQ R is finite, we apply the union bound: Pr ∃cWl ∈ Q R : R(cWl)− bRcal(cWl) ≥t ≤2|Q R|exp − 2nt2 R4M4 X . Setting the right-hand...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.