arxiv: 2604.24008 · v1 · submitted 2026-04-27 · 💻 cs.LG

Recognition: unknown

Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels

Ibne Farabi Shihab , Sanjeda Akter , Anuj Sharma

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords post-training quantizationcalibration selectionoutlier channelsweighted set coverlarge language modelssubmodular optimizationquantization error

0 comments

The pith

Treating PTQ calibration selection as weighted set cover over outlier channels improves quantization accuracy with small sample budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that post-training quantization of large language models suffers when calibration samples fail to activate outlier channels with large activations, causing the quantizer to underestimate ranges and inflate reconstruction errors. It argues that quality depends primarily on weighted coverage of these channels rather than generic sample representativeness or perplexity. The authors recast selection as a monotone submodular weighted set cover problem and solve it with a greedy algorithm, COVERCAL, that uses only precomputed activation statistics and no GPU time. Under a stylized clipping model they prove that missed weighted coverage upper-bounds surrogate loss, giving the objective a theoretical grounding. Experiments across LLaMA-2, LLaMA-3 and Mistral with AWQ and GPTQ show consistent gains, largest when calibration budgets are tight.

Core claim

Calibration quality in PTQ is governed more by weighted outlier-channel coverage than by generic sample representativeness. We formulate the selection as a weighted set cover problem over outlier channels, whose objective is monotone submodular. The greedy COVERCAL algorithm solves it using pre-computed activation statistics with no GPU time. Under a stylized clipping model, missed weighted coverage upper-bounds surrogate loss, making the objective principled. This yields better MMLU scores and lower perplexity than random, max-perplexity, max-activation-variance or stratified baselines, especially at small budgets.

What carries the argument

Weighted set cover over outlier channels, where each calibration sample covers a weighted subset of outlier channels identified from activation statistics, maximized by the greedy algorithm for the monotone submodular objective.

If this is right

At INT4 with 128 samples, COVERCAL improves MMLU by 1.2 to 1.5 points over random calibration.
It reduces perplexity degradation by 15 to 30 percent at small budgets.
With only 64 samples it matches or exceeds random calibration that uses 256 samples.
The gains hold across LLaMA-2, LLaMA-3 and Mistral under both AWQ and GPTQ backends.
Sample selection requires no GPU time once activation statistics are precomputed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Activation statistics can be computed once and reused across different bit-widths or quantization backends without additional inference cost.
The same coverage principle could be tested on other compression methods such as pruning or low-rank adaptation where certain dimensions dominate error.
Because the objective is submodular, faster approximation algorithms become viable for very large candidate pools of calibration data.

Load-bearing premise

The stylized clipping model sufficiently captures real quantization error dynamics in LLMs and outlier channels identified from activation statistics are the dominant source of per-layer reconstruction error.

What would settle it

Running the same PTQ pipeline on LLaMA-3 or Mistral with COVERCAL samples versus random samples and finding no reduction in measured per-channel reconstruction error or downstream task performance would falsify the upper-bound justification and the dominance claim.

Figures

Figures reproduced from arXiv: 2604.24008 by Anuj Sharma, Ibne Farabi Shihab, Sanjeda Akter.

**Figure 1.** Figure 1: Budget efficiency on LLaMA-3-8B with AWQ INT4. view at source ↗

read the original abstract

Post-Training Quantization (PTQ) compresses large language models to low bit-widths using a small calibration set, and its quality depends strongly on which samples are chosen. We identify a failure mode in which calibration samples fail to activate outlier channels, hidden dimensions with unusually large activations, causing the quantizer to underestimate their dynamic range and producing per-channel reconstruction errors that dominate layer-wise loss. Motivated by this observation, we argue that PTQ calibration quality is governed more by weighted outlier-channel coverage than by generic sample representativeness, and formulate calibration selection as a weighted set cover problem over outlier channels. The objective is monotone submodular, and the greedy algorithm, COVERCAL, operates on pre-computed activation statistics and requires no GPU time at selection. We further show that the weight choice is internally consistent: under a stylized clipping model, missed weighted coverage upper-bounds surrogate loss, justifying the weighted coverage objective as principled rather than purely empirical. Across LLaMA-2, LLaMA-3, and Mistral, under AWQ and GPTQ backends and five downstream evaluations, COVERCAL improves over random, max-perplexity, max-activation-variance, and stratified baselines, with the largest gains at small calibration budgets. At INT4 with 128 samples, COVERCAL improves MMLU by 1.2 to 1.5 points over random calibration and reduces perplexity degradation by 15 to 30\%; with 64 samples, it matches or exceeds random calibration at 256. The contribution is not a new PTQ backend but a formulation of calibration selection as weighted outlier coverage, with a simple, efficient algorithm and a surrogate-based justification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper reframes PTQ calibration sample selection as a weighted set cover over outlier channels, with a submodular greedy algorithm that shows clear gains at small budgets, though the stylized clipping bound lacks direct validation against real reconstruction error.

read the letter

The punchline is that COVERCAL gives a practical, low-overhead way to pick calibration samples by prioritizing coverage of outlier channels, and it beats random and a few other baselines on LLaMA and Mistral models under AWQ and GPTQ, especially when the budget is only 64-128 samples. The MMLU lift of 1.2-1.5 points and reduced perplexity degradation are the concrete wins reported in the abstract.

Referee Report

2 major / 2 minor

Summary. The paper proposes COVERCAL, a calibration sample selection algorithm for post-training quantization (PTQ) of LLMs. It observes that poor coverage of outlier channels (dimensions with large activations) leads to dynamic-range underestimation and dominant per-channel reconstruction errors. The method formulates selection as a weighted set-cover problem over these channels, uses the greedy algorithm on pre-computed activation statistics (no GPU at selection time), and provides an internal justification via a stylized clipping model in which missed weighted coverage upper-bounds a surrogate loss. Experiments on LLaMA-2/3 and Mistral with AWQ and GPTQ backends show gains over random, max-perplexity, max-activation-variance, and stratified baselines, especially at small budgets (e.g., 1.2–1.5 MMLU points and 15–30% less perplexity degradation at 128 samples).

Significance. If the stylized-model bound is shown to track real per-layer reconstruction error and the empirical gains prove robust, the work supplies a principled, efficient, and submodular formulation for calibration selection that improves PTQ quality without new quantizers or extra compute during selection. The pre-computed statistics and monotone-submodular guarantee are concrete strengths that could be adopted by existing PTQ pipelines.

major comments (2)

Stylized clipping model (described after the algorithm): the claim that missed weighted coverage upper-bounds surrogate loss is derived only under the paper's internal clipping assumptions; no comparison is reported between this bound and measured layer-wise MSE or downstream degradation on real activations. Without such validation, the justification that the objective is 'principled rather than purely empirical' rests on untested tightness and may not hold when rounding or weight-quantization interactions dominate.
Experimental section (results on LLaMA-2/3, Mistral, AWQ/GPTQ): the reported gains (1.2–1.5 MMLU points at 128 samples) are summary statistics only; the manuscript does not show per-run variance, statistical significance tests, or ablation isolating the contribution of the weighted coverage term versus the outlier-channel identification step itself.

minor comments (2)

Abstract: 'five downstream evaluations' are mentioned but not named; listing the tasks (e.g., MMLU, perplexity on specific datasets) would improve readability.
Notation: define 'outlier channel' and the precise weighting scheme in the first section where they appear, rather than relying on later algorithmic description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and commit to revisions that strengthen the theoretical validation and experimental rigor without altering the core claims.

read point-by-point responses

Referee: Stylized clipping model (described after the algorithm): the claim that missed weighted coverage upper-bounds surrogate loss is derived only under the paper's internal clipping assumptions; no comparison is reported between this bound and measured layer-wise MSE or downstream degradation on real activations. Without such validation, the justification that the objective is 'principled rather than purely empirical' rests on untested tightness and may not hold when rounding or weight-quantization interactions dominate.

Authors: We agree that the stylized model derives the bound under clipping assumptions and that direct empirical validation against real per-layer MSE is not present in the current manuscript. This is a fair observation. In the revision we will add a new appendix section with comparisons of the missed weighted coverage bound to measured layer-wise reconstruction MSE on representative layers from LLaMA-2/3 and Mistral under both AWQ and GPTQ. We will also note the potential influence of rounding and weight quantization on tightness. The derivation still supplies an internally consistent motivation for weighting by outlier magnitude, and the downstream gains remain the primary evidence of utility. revision: yes
Referee: Experimental section (results on LLaMA-2/3, Mistral, AWQ/GPTQ): the reported gains (1.2–1.5 MMLU points at 128 samples) are summary statistics only; the manuscript does not show per-run variance, statistical significance tests, or ablation isolating the contribution of the weighted coverage term versus the outlier-channel identification step itself.

Authors: We acknowledge that variance, significance testing, and isolating ablations are absent from the reported results. In the revised version we will include standard deviations computed over five independent runs of the calibration selection for all key metrics, together with paired t-test p-values for the main improvements versus random calibration. We will also add an ablation that replaces the learned weights with uniform weights while keeping the same outlier-channel set, thereby separating the contribution of the weighting from the outlier identification step. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses standard submodular optimization with an internal model-based bound that does not reduce to inputs by construction

full rationale

The paper identifies outlier channels from activation statistics, formulates sample selection as a weighted set cover problem whose objective is monotone submodular, and applies the standard greedy algorithm (COVERCAL) that runs on pre-computed statistics. The justification that missed weighted coverage upper-bounds surrogate loss is derived under an explicitly stylized clipping model presented as an internal consistency argument rather than an external theorem or fitted parameter. No equations, self-citations, or renamings in the abstract or described chain equate any claimed result to its own inputs by definition; the algorithm and bound remain independent of the target PTQ reconstruction loss they aim to improve.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on pre-computed activation statistics to identify outlier channels and on a stylized clipping model used to bound surrogate loss; no new entities are postulated.

free parameters (1)

channel weights
Weights assigned to outlier channels to reflect their contribution to loss; exact computation rule not detailed in abstract.

axioms (1)

domain assumption Stylized clipping model accurately upper-bounds quantization surrogate loss
Invoked to justify that missed weighted coverage bounds the error.

pith-pipeline@v0.9.0 · 5620 in / 1300 out tokens · 91328 ms · 2026-05-08T04:31:39.816605+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 12 canonical work pages · 4 internal anchors

[1]

QuaRot: Outlier-free 4-bit inference in rotated LLMs

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InNeurIPS, 2024

2024
[2]

Quantizable transformers: Removing outliers by helping attention heads do nothing

Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. InICLR, 2024

2024
[3]

QuIP: 2-bit quantization of large language models with guarantees

Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InNeurIPS, 2023

2023
[4]

A greedy heuristic for the set-covering problem.Mathematics of Operations Research, 4(3):233–235, 1979

Vašek Chvátal. A greedy heuristic for the set-covering problem.Mathematics of Operations Research, 4(3):233–235, 1979

1979
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC.arXiv:1803.05457, 2018

work page internal anchor Pith review arXiv 2018
[6]

GPT3.int8(): 8-bit matrix multiplication for transformers at scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. InNeurIPS, 2022

2022
[7]

Spqr: A sparse-quantized representation for near-lossless llm weight compression,

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compression.arXiv:2306.03078, 2023

work page arXiv 2023
[8]

GPTQ: Accurate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InICLR, 2023. 9

2023
[9]

arXiv preprint

Bowei He, Guangxuan Xiao, and Song Han. Preserving LLM capabilities through calibration data curation: From analysis to optimization.arXiv:2510.10618, 2025

work page arXiv 2025
[10]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InICLR, 2021

2021
[11]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv:2310.06825, 2023

work page internal anchor Pith review arXiv 2023
[12]

GLISTER: Generalization based data subset selection for efficient and robust learning

Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. GLISTER: Generalization based data subset selection for efficient and robust learning. In AAAI, 2021

2021
[13]

Squeezellm: Dense-and-sparse quan- tization,

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM: Dense-and-sparse quantization.arXiv:2306.07629, 2023

work page arXiv 2023
[14]

On the impact of calibration data in post-training quantization and pruning.arXiv:2311.09755, 2023

Matej Klemen and Marko Robnik-Šikonja. On the impact of calibration data in post-training quantization and pruning.arXiv:2311.09755, 2023

work page arXiv 2023
[15]

AWQ: Activation-aware weight quantization for LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration. InMLSys, 2024

2024
[16]

Spinquant: Llm quantization with learned rotations

Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, and Vikas Chandra. SpinQuant: LLM quantization with learned rotations.arXiv:2405.16406, 2024

work page arXiv 2024
[17]

Llama 3 model card, 2024

Meta AI. Llama 3 model card, 2024

2024
[18]

Coresets for data-efficient training of machine learning models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InICML, 2020

2020
[19]

Francesco Pio Monaco, Bhanukiran Vinzamuri, Dhruv Choudhary, and Sashank J. Reddi. Frequency matters: Fast model-agnostic data curation for pruning and quantization. arXiv:2603.16105, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Nemhauser, Laurence A

George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approx- imations for maximizing submodular set functions—I.Mathematical Programming, 14(1): 265–294, 1978

1978
[21]

Outliers and calibration sets have diminishing effect on quantization of modern LLMs.arXiv:2405.20835, 2024

Andrei Panferov, Alexander Borzunov, and Artem Ushakov. Outliers and calibration sets have diminishing effect on quantization of modern LLMs.arXiv:2405.20835, 2024

work page arXiv 2024
[22]

WinoGrande: An adversarial Winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd schema challenge at scale. InAAAI, 2020

2020
[23]

Active learning for convolutional neural networks: A core-set approach

Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. InICLR, 2018

2018
[24]

Active learning literature survey

Burr Settles. Active learning literature survey. Technical report, University of Wisconsin– Madison, 2009

2009
[25]

Beyond variance: Knowledge-aware LLM compression via Fisher-aligned subspace diagnostics.arXiv:2601.07197, 2026

Ibne Farabi Shihab, Sanjeda Akter, and Anuj Sharma. Beyond variance: Knowledge-aware LLM compression via Fisher-aligned subspace diagnostics.arXiv:2601.07197, 2026

work page arXiv 2026
[26]

Redpajama: An open dataset for training large language models, 2023

Together AI. Redpajama: An open dataset for training large language models, 2023

2023
[27]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Irene Lu- ber, Juan Casas, Ludovic Denoyer, Oleksandr Drozd, Sergey Elbaum, David Esiobu, Cyn- thia Breazeal Ferrer, Naman Goyal, and Graham Hartshorn. Llama 2: Open foundation and fine-tune...

work page internal anchor Pith review arXiv 2023
[28]

Self-calibration for language model quantization and pruning

Miles Williams, George Chrysostomou, and Nikolaos Aletras. Self-calibration for language model quantization and pruning. InNAACL, 2025. arXiv:2410.17170

work page arXiv 2025
[29]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In ICML, 2023

2023
[30]

Data selection for language models via importance resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. InNeurIPS, 2023

2023
[31]

TurboQuant: Online vector quantization with near-optimal distortion rate

Amir Zandieh, Majid Daliri, Amin Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate. InICLR, 2026

2026
[32]

pick one code sample, one math sample, one multilingual sample

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InACL, 2019. A Extended Results: LLaMA-2-13B Table 3: LLaMA-2-13B results under AWQ INT4 quantization withK= 128calibration samples. Calibration MMLU ARC-C HellaSwag WinoGr. Wiki-PPL↓ FP16 55.0 59.4 82.1 76.0 4.88 Random 53.5 57.8 ...

2019