pith. machine review for the scientific record. sign in

arxiv: 2604.24008 · v1 · submitted 2026-04-27 · 💻 cs.LG

Recognition: unknown

Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:31 UTC · model grok-4.3

classification 💻 cs.LG
keywords post-training quantizationcalibration selectionoutlier channelsweighted set coverlarge language modelssubmodular optimizationquantization error
0
0 comments X

The pith

Treating PTQ calibration selection as weighted set cover over outlier channels improves quantization accuracy with small sample budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that post-training quantization of large language models suffers when calibration samples fail to activate outlier channels with large activations, causing the quantizer to underestimate ranges and inflate reconstruction errors. It argues that quality depends primarily on weighted coverage of these channels rather than generic sample representativeness or perplexity. The authors recast selection as a monotone submodular weighted set cover problem and solve it with a greedy algorithm, COVERCAL, that uses only precomputed activation statistics and no GPU time. Under a stylized clipping model they prove that missed weighted coverage upper-bounds surrogate loss, giving the objective a theoretical grounding. Experiments across LLaMA-2, LLaMA-3 and Mistral with AWQ and GPTQ show consistent gains, largest when calibration budgets are tight.

Core claim

Calibration quality in PTQ is governed more by weighted outlier-channel coverage than by generic sample representativeness. We formulate the selection as a weighted set cover problem over outlier channels, whose objective is monotone submodular. The greedy COVERCAL algorithm solves it using pre-computed activation statistics with no GPU time. Under a stylized clipping model, missed weighted coverage upper-bounds surrogate loss, making the objective principled. This yields better MMLU scores and lower perplexity than random, max-perplexity, max-activation-variance or stratified baselines, especially at small budgets.

What carries the argument

Weighted set cover over outlier channels, where each calibration sample covers a weighted subset of outlier channels identified from activation statistics, maximized by the greedy algorithm for the monotone submodular objective.

If this is right

  • At INT4 with 128 samples, COVERCAL improves MMLU by 1.2 to 1.5 points over random calibration.
  • It reduces perplexity degradation by 15 to 30 percent at small budgets.
  • With only 64 samples it matches or exceeds random calibration that uses 256 samples.
  • The gains hold across LLaMA-2, LLaMA-3 and Mistral under both AWQ and GPTQ backends.
  • Sample selection requires no GPU time once activation statistics are precomputed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Activation statistics can be computed once and reused across different bit-widths or quantization backends without additional inference cost.
  • The same coverage principle could be tested on other compression methods such as pruning or low-rank adaptation where certain dimensions dominate error.
  • Because the objective is submodular, faster approximation algorithms become viable for very large candidate pools of calibration data.

Load-bearing premise

The stylized clipping model sufficiently captures real quantization error dynamics in LLMs and outlier channels identified from activation statistics are the dominant source of per-layer reconstruction error.

What would settle it

Running the same PTQ pipeline on LLaMA-3 or Mistral with COVERCAL samples versus random samples and finding no reduction in measured per-channel reconstruction error or downstream task performance would falsify the upper-bound justification and the dominance claim.

Figures

Figures reproduced from arXiv: 2604.24008 by Anuj Sharma, Ibne Farabi Shihab, Sanjeda Akter.

Figure 1
Figure 1. Figure 1: Budget efficiency on LLaMA-3-8B with AWQ INT4. view at source ↗
read the original abstract

Post-Training Quantization (PTQ) compresses large language models to low bit-widths using a small calibration set, and its quality depends strongly on which samples are chosen. We identify a failure mode in which calibration samples fail to activate outlier channels, hidden dimensions with unusually large activations, causing the quantizer to underestimate their dynamic range and producing per-channel reconstruction errors that dominate layer-wise loss. Motivated by this observation, we argue that PTQ calibration quality is governed more by weighted outlier-channel coverage than by generic sample representativeness, and formulate calibration selection as a weighted set cover problem over outlier channels. The objective is monotone submodular, and the greedy algorithm, COVERCAL, operates on pre-computed activation statistics and requires no GPU time at selection. We further show that the weight choice is internally consistent: under a stylized clipping model, missed weighted coverage upper-bounds surrogate loss, justifying the weighted coverage objective as principled rather than purely empirical. Across LLaMA-2, LLaMA-3, and Mistral, under AWQ and GPTQ backends and five downstream evaluations, COVERCAL improves over random, max-perplexity, max-activation-variance, and stratified baselines, with the largest gains at small calibration budgets. At INT4 with 128 samples, COVERCAL improves MMLU by 1.2 to 1.5 points over random calibration and reduces perplexity degradation by 15 to 30\%; with 64 samples, it matches or exceeds random calibration at 256. The contribution is not a new PTQ backend but a formulation of calibration selection as weighted outlier coverage, with a simple, efficient algorithm and a surrogate-based justification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes COVERCAL, a calibration sample selection algorithm for post-training quantization (PTQ) of LLMs. It observes that poor coverage of outlier channels (dimensions with large activations) leads to dynamic-range underestimation and dominant per-channel reconstruction errors. The method formulates selection as a weighted set-cover problem over these channels, uses the greedy algorithm on pre-computed activation statistics (no GPU at selection time), and provides an internal justification via a stylized clipping model in which missed weighted coverage upper-bounds a surrogate loss. Experiments on LLaMA-2/3 and Mistral with AWQ and GPTQ backends show gains over random, max-perplexity, max-activation-variance, and stratified baselines, especially at small budgets (e.g., 1.2–1.5 MMLU points and 15–30% less perplexity degradation at 128 samples).

Significance. If the stylized-model bound is shown to track real per-layer reconstruction error and the empirical gains prove robust, the work supplies a principled, efficient, and submodular formulation for calibration selection that improves PTQ quality without new quantizers or extra compute during selection. The pre-computed statistics and monotone-submodular guarantee are concrete strengths that could be adopted by existing PTQ pipelines.

major comments (2)
  1. Stylized clipping model (described after the algorithm): the claim that missed weighted coverage upper-bounds surrogate loss is derived only under the paper's internal clipping assumptions; no comparison is reported between this bound and measured layer-wise MSE or downstream degradation on real activations. Without such validation, the justification that the objective is 'principled rather than purely empirical' rests on untested tightness and may not hold when rounding or weight-quantization interactions dominate.
  2. Experimental section (results on LLaMA-2/3, Mistral, AWQ/GPTQ): the reported gains (1.2–1.5 MMLU points at 128 samples) are summary statistics only; the manuscript does not show per-run variance, statistical significance tests, or ablation isolating the contribution of the weighted coverage term versus the outlier-channel identification step itself.
minor comments (2)
  1. Abstract: 'five downstream evaluations' are mentioned but not named; listing the tasks (e.g., MMLU, perplexity on specific datasets) would improve readability.
  2. Notation: define 'outlier channel' and the precise weighting scheme in the first section where they appear, rather than relying on later algorithmic description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and commit to revisions that strengthen the theoretical validation and experimental rigor without altering the core claims.

read point-by-point responses
  1. Referee: Stylized clipping model (described after the algorithm): the claim that missed weighted coverage upper-bounds surrogate loss is derived only under the paper's internal clipping assumptions; no comparison is reported between this bound and measured layer-wise MSE or downstream degradation on real activations. Without such validation, the justification that the objective is 'principled rather than purely empirical' rests on untested tightness and may not hold when rounding or weight-quantization interactions dominate.

    Authors: We agree that the stylized model derives the bound under clipping assumptions and that direct empirical validation against real per-layer MSE is not present in the current manuscript. This is a fair observation. In the revision we will add a new appendix section with comparisons of the missed weighted coverage bound to measured layer-wise reconstruction MSE on representative layers from LLaMA-2/3 and Mistral under both AWQ and GPTQ. We will also note the potential influence of rounding and weight quantization on tightness. The derivation still supplies an internally consistent motivation for weighting by outlier magnitude, and the downstream gains remain the primary evidence of utility. revision: yes

  2. Referee: Experimental section (results on LLaMA-2/3, Mistral, AWQ/GPTQ): the reported gains (1.2–1.5 MMLU points at 128 samples) are summary statistics only; the manuscript does not show per-run variance, statistical significance tests, or ablation isolating the contribution of the weighted coverage term versus the outlier-channel identification step itself.

    Authors: We acknowledge that variance, significance testing, and isolating ablations are absent from the reported results. In the revised version we will include standard deviations computed over five independent runs of the calibration selection for all key metrics, together with paired t-test p-values for the main improvements versus random calibration. We will also add an ablation that replaces the learned weights with uniform weights while keeping the same outlier-channel set, thereby separating the contribution of the weighting from the outlier identification step. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses standard submodular optimization with an internal model-based bound that does not reduce to inputs by construction

full rationale

The paper identifies outlier channels from activation statistics, formulates sample selection as a weighted set cover problem whose objective is monotone submodular, and applies the standard greedy algorithm (COVERCAL) that runs on pre-computed statistics. The justification that missed weighted coverage upper-bounds surrogate loss is derived under an explicitly stylized clipping model presented as an internal consistency argument rather than an external theorem or fitted parameter. No equations, self-citations, or renamings in the abstract or described chain equate any claimed result to its own inputs by definition; the algorithm and bound remain independent of the target PTQ reconstruction loss they aim to improve.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on pre-computed activation statistics to identify outlier channels and on a stylized clipping model used to bound surrogate loss; no new entities are postulated.

free parameters (1)
  • channel weights
    Weights assigned to outlier channels to reflect their contribution to loss; exact computation rule not detailed in abstract.
axioms (1)
  • domain assumption Stylized clipping model accurately upper-bounds quantization surrogate loss
    Invoked to justify that missed weighted coverage bounds the error.

pith-pipeline@v0.9.0 · 5620 in / 1300 out tokens · 91328 ms · 2026-05-08T04:31:39.816605+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    QuaRot: Outlier-free 4-bit inference in rotated LLMs

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InNeurIPS, 2024

  2. [2]

    Quantizable transformers: Removing outliers by helping attention heads do nothing

    Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. InICLR, 2024

  3. [3]

    QuIP: 2-bit quantization of large language models with guarantees

    Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InNeurIPS, 2023

  4. [4]

    A greedy heuristic for the set-covering problem.Mathematics of Operations Research, 4(3):233–235, 1979

    Vašek Chvátal. A greedy heuristic for the set-covering problem.Mathematics of Operations Research, 4(3):233–235, 1979

  5. [5]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC.arXiv:1803.05457, 2018

  6. [6]

    GPT3.int8(): 8-bit matrix multiplication for transformers at scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. InNeurIPS, 2022

  7. [7]

    Spqr: A sparse-quantized representation for near-lossless llm weight compression,

    Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compression.arXiv:2306.03078, 2023

  8. [8]

    GPTQ: Accurate post-training quantization for generative pre-trained transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InICLR, 2023. 9

  9. [9]

    arXiv preprint

    Bowei He, Guangxuan Xiao, and Song Han. Preserving LLM capabilities through calibration data curation: From analysis to optimization.arXiv:2510.10618, 2025

  10. [10]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InICLR, 2021

  11. [11]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv:2310.06825, 2023

  12. [12]

    GLISTER: Generalization based data subset selection for efficient and robust learning

    Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. GLISTER: Generalization based data subset selection for efficient and robust learning. In AAAI, 2021

  13. [13]

    Squeezellm: Dense-and-sparse quan- tization,

    Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM: Dense-and-sparse quantization.arXiv:2306.07629, 2023

  14. [14]

    On the impact of calibration data in post-training quantization and pruning.arXiv:2311.09755, 2023

    Matej Klemen and Marko Robnik-Šikonja. On the impact of calibration data in post-training quantization and pruning.arXiv:2311.09755, 2023

  15. [15]

    AWQ: Activation-aware weight quantization for LLM compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration. InMLSys, 2024

  16. [16]

    Spinquant: Llm quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, and Vikas Chandra. SpinQuant: LLM quantization with learned rotations.arXiv:2405.16406, 2024

  17. [17]

    Llama 3 model card, 2024

    Meta AI. Llama 3 model card, 2024

  18. [18]

    Coresets for data-efficient training of machine learning models

    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InICML, 2020

  19. [19]

    Francesco Pio Monaco, Bhanukiran Vinzamuri, Dhruv Choudhary, and Sashank J. Reddi. Frequency matters: Fast model-agnostic data curation for pruning and quantization. arXiv:2603.16105, 2026

  20. [20]

    Nemhauser, Laurence A

    George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approx- imations for maximizing submodular set functions—I.Mathematical Programming, 14(1): 265–294, 1978

  21. [21]

    Outliers and calibration sets have diminishing effect on quantization of modern LLMs.arXiv:2405.20835, 2024

    Andrei Panferov, Alexander Borzunov, and Artem Ushakov. Outliers and calibration sets have diminishing effect on quantization of modern LLMs.arXiv:2405.20835, 2024

  22. [22]

    WinoGrande: An adversarial Winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd schema challenge at scale. InAAAI, 2020

  23. [23]

    Active learning for convolutional neural networks: A core-set approach

    Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. InICLR, 2018

  24. [24]

    Active learning literature survey

    Burr Settles. Active learning literature survey. Technical report, University of Wisconsin– Madison, 2009

  25. [25]

    Beyond variance: Knowledge-aware LLM compression via Fisher-aligned subspace diagnostics.arXiv:2601.07197, 2026

    Ibne Farabi Shihab, Sanjeda Akter, and Anuj Sharma. Beyond variance: Knowledge-aware LLM compression via Fisher-aligned subspace diagnostics.arXiv:2601.07197, 2026

  26. [26]

    Redpajama: An open dataset for training large language models, 2023

    Together AI. Redpajama: An open dataset for training large language models, 2023

  27. [27]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Irene Lu- ber, Juan Casas, Ludovic Denoyer, Oleksandr Drozd, Sergey Elbaum, David Esiobu, Cyn- thia Breazeal Ferrer, Naman Goyal, and Graham Hartshorn. Llama 2: Open foundation and fine-tune...

  28. [28]

    Self-calibration for language model quantization and pruning

    Miles Williams, George Chrysostomou, and Nikolaos Aletras. Self-calibration for language model quantization and pruning. InNAACL, 2025. arXiv:2410.17170

  29. [29]

    SmoothQuant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In ICML, 2023

  30. [30]

    Data selection for language models via importance resampling

    Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. InNeurIPS, 2023

  31. [31]

    TurboQuant: Online vector quantization with near-optimal distortion rate

    Amir Zandieh, Majid Daliri, Amin Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate. InICLR, 2026

  32. [32]

    pick one code sample, one math sample, one multilingual sample

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InACL, 2019. A Extended Results: LLaMA-2-13B Table 3: LLaMA-2-13B results under AWQ INT4 quantization withK= 128calibration samples. Calibration MMLU ARC-C HellaSwag WinoGr. Wiki-PPL↓ FP16 55.0 59.4 82.1 76.0 4.88 Random 53.5 57.8 ...