Recognition: unknown
Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels
Pith reviewed 2026-05-08 04:31 UTC · model grok-4.3
The pith
Treating PTQ calibration selection as weighted set cover over outlier channels improves quantization accuracy with small sample budgets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Calibration quality in PTQ is governed more by weighted outlier-channel coverage than by generic sample representativeness. We formulate the selection as a weighted set cover problem over outlier channels, whose objective is monotone submodular. The greedy COVERCAL algorithm solves it using pre-computed activation statistics with no GPU time. Under a stylized clipping model, missed weighted coverage upper-bounds surrogate loss, making the objective principled. This yields better MMLU scores and lower perplexity than random, max-perplexity, max-activation-variance or stratified baselines, especially at small budgets.
What carries the argument
Weighted set cover over outlier channels, where each calibration sample covers a weighted subset of outlier channels identified from activation statistics, maximized by the greedy algorithm for the monotone submodular objective.
If this is right
- At INT4 with 128 samples, COVERCAL improves MMLU by 1.2 to 1.5 points over random calibration.
- It reduces perplexity degradation by 15 to 30 percent at small budgets.
- With only 64 samples it matches or exceeds random calibration that uses 256 samples.
- The gains hold across LLaMA-2, LLaMA-3 and Mistral under both AWQ and GPTQ backends.
- Sample selection requires no GPU time once activation statistics are precomputed.
Where Pith is reading between the lines
- Activation statistics can be computed once and reused across different bit-widths or quantization backends without additional inference cost.
- The same coverage principle could be tested on other compression methods such as pruning or low-rank adaptation where certain dimensions dominate error.
- Because the objective is submodular, faster approximation algorithms become viable for very large candidate pools of calibration data.
Load-bearing premise
The stylized clipping model sufficiently captures real quantization error dynamics in LLMs and outlier channels identified from activation statistics are the dominant source of per-layer reconstruction error.
What would settle it
Running the same PTQ pipeline on LLaMA-3 or Mistral with COVERCAL samples versus random samples and finding no reduction in measured per-channel reconstruction error or downstream task performance would falsify the upper-bound justification and the dominance claim.
Figures
read the original abstract
Post-Training Quantization (PTQ) compresses large language models to low bit-widths using a small calibration set, and its quality depends strongly on which samples are chosen. We identify a failure mode in which calibration samples fail to activate outlier channels, hidden dimensions with unusually large activations, causing the quantizer to underestimate their dynamic range and producing per-channel reconstruction errors that dominate layer-wise loss. Motivated by this observation, we argue that PTQ calibration quality is governed more by weighted outlier-channel coverage than by generic sample representativeness, and formulate calibration selection as a weighted set cover problem over outlier channels. The objective is monotone submodular, and the greedy algorithm, COVERCAL, operates on pre-computed activation statistics and requires no GPU time at selection. We further show that the weight choice is internally consistent: under a stylized clipping model, missed weighted coverage upper-bounds surrogate loss, justifying the weighted coverage objective as principled rather than purely empirical. Across LLaMA-2, LLaMA-3, and Mistral, under AWQ and GPTQ backends and five downstream evaluations, COVERCAL improves over random, max-perplexity, max-activation-variance, and stratified baselines, with the largest gains at small calibration budgets. At INT4 with 128 samples, COVERCAL improves MMLU by 1.2 to 1.5 points over random calibration and reduces perplexity degradation by 15 to 30\%; with 64 samples, it matches or exceeds random calibration at 256. The contribution is not a new PTQ backend but a formulation of calibration selection as weighted outlier coverage, with a simple, efficient algorithm and a surrogate-based justification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes COVERCAL, a calibration sample selection algorithm for post-training quantization (PTQ) of LLMs. It observes that poor coverage of outlier channels (dimensions with large activations) leads to dynamic-range underestimation and dominant per-channel reconstruction errors. The method formulates selection as a weighted set-cover problem over these channels, uses the greedy algorithm on pre-computed activation statistics (no GPU at selection time), and provides an internal justification via a stylized clipping model in which missed weighted coverage upper-bounds a surrogate loss. Experiments on LLaMA-2/3 and Mistral with AWQ and GPTQ backends show gains over random, max-perplexity, max-activation-variance, and stratified baselines, especially at small budgets (e.g., 1.2–1.5 MMLU points and 15–30% less perplexity degradation at 128 samples).
Significance. If the stylized-model bound is shown to track real per-layer reconstruction error and the empirical gains prove robust, the work supplies a principled, efficient, and submodular formulation for calibration selection that improves PTQ quality without new quantizers or extra compute during selection. The pre-computed statistics and monotone-submodular guarantee are concrete strengths that could be adopted by existing PTQ pipelines.
major comments (2)
- Stylized clipping model (described after the algorithm): the claim that missed weighted coverage upper-bounds surrogate loss is derived only under the paper's internal clipping assumptions; no comparison is reported between this bound and measured layer-wise MSE or downstream degradation on real activations. Without such validation, the justification that the objective is 'principled rather than purely empirical' rests on untested tightness and may not hold when rounding or weight-quantization interactions dominate.
- Experimental section (results on LLaMA-2/3, Mistral, AWQ/GPTQ): the reported gains (1.2–1.5 MMLU points at 128 samples) are summary statistics only; the manuscript does not show per-run variance, statistical significance tests, or ablation isolating the contribution of the weighted coverage term versus the outlier-channel identification step itself.
minor comments (2)
- Abstract: 'five downstream evaluations' are mentioned but not named; listing the tasks (e.g., MMLU, perplexity on specific datasets) would improve readability.
- Notation: define 'outlier channel' and the precise weighting scheme in the first section where they appear, rather than relying on later algorithmic description.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below and commit to revisions that strengthen the theoretical validation and experimental rigor without altering the core claims.
read point-by-point responses
-
Referee: Stylized clipping model (described after the algorithm): the claim that missed weighted coverage upper-bounds surrogate loss is derived only under the paper's internal clipping assumptions; no comparison is reported between this bound and measured layer-wise MSE or downstream degradation on real activations. Without such validation, the justification that the objective is 'principled rather than purely empirical' rests on untested tightness and may not hold when rounding or weight-quantization interactions dominate.
Authors: We agree that the stylized model derives the bound under clipping assumptions and that direct empirical validation against real per-layer MSE is not present in the current manuscript. This is a fair observation. In the revision we will add a new appendix section with comparisons of the missed weighted coverage bound to measured layer-wise reconstruction MSE on representative layers from LLaMA-2/3 and Mistral under both AWQ and GPTQ. We will also note the potential influence of rounding and weight quantization on tightness. The derivation still supplies an internally consistent motivation for weighting by outlier magnitude, and the downstream gains remain the primary evidence of utility. revision: yes
-
Referee: Experimental section (results on LLaMA-2/3, Mistral, AWQ/GPTQ): the reported gains (1.2–1.5 MMLU points at 128 samples) are summary statistics only; the manuscript does not show per-run variance, statistical significance tests, or ablation isolating the contribution of the weighted coverage term versus the outlier-channel identification step itself.
Authors: We acknowledge that variance, significance testing, and isolating ablations are absent from the reported results. In the revised version we will include standard deviations computed over five independent runs of the calibration selection for all key metrics, together with paired t-test p-values for the main improvements versus random calibration. We will also add an ablation that replaces the learned weights with uniform weights while keeping the same outlier-channel set, thereby separating the contribution of the weighting from the outlier identification step. revision: yes
Circularity Check
No circularity: derivation uses standard submodular optimization with an internal model-based bound that does not reduce to inputs by construction
full rationale
The paper identifies outlier channels from activation statistics, formulates sample selection as a weighted set cover problem whose objective is monotone submodular, and applies the standard greedy algorithm (COVERCAL) that runs on pre-computed statistics. The justification that missed weighted coverage upper-bounds surrogate loss is derived under an explicitly stylized clipping model presented as an internal consistency argument rather than an external theorem or fitted parameter. No equations, self-citations, or renamings in the abstract or described chain equate any claimed result to its own inputs by definition; the algorithm and bound remain independent of the target PTQ reconstruction loss they aim to improve.
Axiom & Free-Parameter Ledger
free parameters (1)
- channel weights
axioms (1)
- domain assumption Stylized clipping model accurately upper-bounds quantization surrogate loss
Reference graph
Works this paper leans on
-
[1]
QuaRot: Outlier-free 4-bit inference in rotated LLMs
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InNeurIPS, 2024
2024
-
[2]
Quantizable transformers: Removing outliers by helping attention heads do nothing
Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. InICLR, 2024
2024
-
[3]
QuIP: 2-bit quantization of large language models with guarantees
Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InNeurIPS, 2023
2023
-
[4]
A greedy heuristic for the set-covering problem.Mathematics of Operations Research, 4(3):233–235, 1979
Vašek Chvátal. A greedy heuristic for the set-covering problem.Mathematics of Operations Research, 4(3):233–235, 1979
1979
-
[5]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC.arXiv:1803.05457, 2018
work page internal anchor Pith review arXiv 2018
-
[6]
GPT3.int8(): 8-bit matrix multiplication for transformers at scale
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. InNeurIPS, 2022
2022
-
[7]
Spqr: A sparse-quantized representation for near-lossless llm weight compression,
Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compression.arXiv:2306.03078, 2023
-
[8]
GPTQ: Accurate post-training quantization for generative pre-trained transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InICLR, 2023. 9
2023
-
[9]
Bowei He, Guangxuan Xiao, and Song Han. Preserving LLM capabilities through calibration data curation: From analysis to optimization.arXiv:2510.10618, 2025
-
[10]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InICLR, 2021
2021
-
[11]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv:2310.06825, 2023
work page internal anchor Pith review arXiv 2023
-
[12]
GLISTER: Generalization based data subset selection for efficient and robust learning
Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. GLISTER: Generalization based data subset selection for efficient and robust learning. In AAAI, 2021
2021
-
[13]
Squeezellm: Dense-and-sparse quan- tization,
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM: Dense-and-sparse quantization.arXiv:2306.07629, 2023
-
[14]
On the impact of calibration data in post-training quantization and pruning.arXiv:2311.09755, 2023
Matej Klemen and Marko Robnik-Šikonja. On the impact of calibration data in post-training quantization and pruning.arXiv:2311.09755, 2023
-
[15]
AWQ: Activation-aware weight quantization for LLM compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration. InMLSys, 2024
2024
-
[16]
Spinquant: Llm quantization with learned rotations
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, and Vikas Chandra. SpinQuant: LLM quantization with learned rotations.arXiv:2405.16406, 2024
-
[17]
Llama 3 model card, 2024
Meta AI. Llama 3 model card, 2024
2024
-
[18]
Coresets for data-efficient training of machine learning models
Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InICML, 2020
2020
-
[19]
Francesco Pio Monaco, Bhanukiran Vinzamuri, Dhruv Choudhary, and Sashank J. Reddi. Frequency matters: Fast model-agnostic data curation for pruning and quantization. arXiv:2603.16105, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Nemhauser, Laurence A
George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approx- imations for maximizing submodular set functions—I.Mathematical Programming, 14(1): 265–294, 1978
1978
-
[21]
Andrei Panferov, Alexander Borzunov, and Artem Ushakov. Outliers and calibration sets have diminishing effect on quantization of modern LLMs.arXiv:2405.20835, 2024
-
[22]
WinoGrande: An adversarial Winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd schema challenge at scale. InAAAI, 2020
2020
-
[23]
Active learning for convolutional neural networks: A core-set approach
Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. InICLR, 2018
2018
-
[24]
Active learning literature survey
Burr Settles. Active learning literature survey. Technical report, University of Wisconsin– Madison, 2009
2009
-
[25]
Ibne Farabi Shihab, Sanjeda Akter, and Anuj Sharma. Beyond variance: Knowledge-aware LLM compression via Fisher-aligned subspace diagnostics.arXiv:2601.07197, 2026
-
[26]
Redpajama: An open dataset for training large language models, 2023
Together AI. Redpajama: An open dataset for training large language models, 2023
2023
-
[27]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Irene Lu- ber, Juan Casas, Ludovic Denoyer, Oleksandr Drozd, Sergey Elbaum, David Esiobu, Cyn- thia Breazeal Ferrer, Naman Goyal, and Graham Hartshorn. Llama 2: Open foundation and fine-tune...
work page internal anchor Pith review arXiv 2023
-
[28]
Self-calibration for language model quantization and pruning
Miles Williams, George Chrysostomou, and Nikolaos Aletras. Self-calibration for language model quantization and pruning. InNAACL, 2025. arXiv:2410.17170
-
[29]
SmoothQuant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In ICML, 2023
2023
-
[30]
Data selection for language models via importance resampling
Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. InNeurIPS, 2023
2023
-
[31]
TurboQuant: Online vector quantization with near-optimal distortion rate
Amir Zandieh, Majid Daliri, Amin Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate. InICLR, 2026
2026
-
[32]
pick one code sample, one math sample, one multilingual sample
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InACL, 2019. A Extended Results: LLaMA-2-13B Table 3: LLaMA-2-13B results under AWQ INT4 quantization withK= 128calibration samples. Calibration MMLU ARC-C HellaSwag WinoGr. Wiki-PPL↓ FP16 55.0 59.4 82.1 76.0 4.88 Random 53.5 57.8 ...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.