Beyond Activation Alignment:The Alignment-Diversity Tradeoff in Task-Aware LLM Quantization

Changxing Ding; Chao Xue; Fei Wang; Li Shen; Taoran Liu; Ye Liu

arxiv: 2607.00908 · v1 · pith:7N2ZD2GSnew · submitted 2026-07-01 · 💻 cs.LG

Beyond Activation Alignment:The Alignment-Diversity Tradeoff in Task-Aware LLM Quantization

Fei Wang , Chao Xue , Taoran Liu , Li Shen , Ye Liu , ChangXing Ding This is my paper

Pith reviewed 2026-07-02 15:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixed-precision quantizationtask-aware quantizationLLM compressioncalibration data compositionsensitivity analysisperplexity illusiongradient-trace alignmentbit allocation

0 comments

The pith

Appropriately allocated 3.5-bit LLMs match or surpass 4-bit baselines by balancing task-specific and general calibration data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first shows that perplexity-based rankings of layer importance bear almost no relation to the layers that actually drive complex reasoning performance. It then demonstrates an alignment-diversity tradeoff: calibration data drawn only from the target task can degrade final accuracy, while adding general-domain data stabilizes the sensitivity estimates. TASA searches for the best mixture with a training-free gradient-trace criterion and then combines perplexity and reasoning signals to decide both inter-layer and intra-layer bit widths. The resulting mixed-precision models at 3.5 bits average precision reach or exceed several 4-bit uniform baselines and deliver large gains on GSM8K.

Core claim

TASA jointly optimizes calibration-data composition via a training-free gradient-trace alignment criterion and mixed-precision bit allocation by aggregating perplexity and reasoning-oriented sensitivity signals. This produces a precision inversion in which 3.5-bit average models match or exceed less task-aware 4-bit baselines, including more than 20 absolute points on GSM8K for LLaMA-3-8B over the strongest W3 baseline.

What carries the argument

TASA, a two-level framework that first selects an optimal calibration-data mixture and then guides inter-layer and intra-layer bit allocation using combined sensitivity signals.

Load-bearing premise

The training-free gradient-trace alignment criterion accurately identifies calibration mixtures that improve downstream task performance without overfitting to the search metric.

What would settle it

If 3.5-bit TASA-quantized models consistently underperform 4-bit uniform baselines on multiple reasoning benchmarks across different model families, the claimed precision inversion would be disproved.

Figures

Figures reproduced from arXiv: 2607.00908 by Changxing Ding, Chao Xue, Fei Wang, Li Shen, Taoran Liu, Ye Liu.

**Figure 1.** Figure 1: The Alignment-Diversity Tradeoff. Red curve (left axis): cosine similarity between calibration and target-task traces. Blue dashed curve (right axis): post-quantization aggregate accuracy (Avg.). Optimal accuracy occurs at an intermediate mixing ratio, not at maximum alignment. whereas reasoning-sensitive layers are broadly distributed across middle layers that perform compositional inference (full heatm… view at source ↗

**Figure 2.** Figure 2: Overview of the TASA framework. TASA jointly optimizes data composition and bit allocation through four steps: (1) finding the optimal calibration mix via training-free gradient trace alignment; (2) computing a Multi-Objective Aggregation (MOA) sensitivity matrix across task-specific metrics; (3) determining inter-layer average bit-widths via Integer Linear Programming (ILP) under a global budget; and (4)… view at source ↗

**Figure 3.** Figure 3: Aggregate accuracy vs. effective bit-width. The green scaling curve illustrates TASA’s smooth performance progression from 3.0 to 4.0 bits. Gray markers denote uniform-precision baselines at their effective bit-widths, and the dashed line represents the FP16 upper bound. 5.1 MAIN RESULTS Tabs. 6 and 7 in Sec. B.2 present the full per-task results; [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Per-layer sensitivity profiles and MOA bit allocation. (a)–(b) Layer-wise sensitivity under W3 quantization for three objectives (symmetric log scale; final layer clipped for visibility). (c)–(d) Bit-width allocation produced by MOA at b3.5 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Per-task accuracy shifts under extreme calibration mix ratios. Bars show absolute accuracy changes (percentage points) relative to the balanced mix (αwiki = 0.50). The direction and magnitude of the shifts are model-dependent, showing the need for the calibration selection [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Reasoning accuracy (GSM8K) vs. effective bit-width. The TASA scaling curve (green) shows smooth progression from 3.0 to 4.0 bits. Muted markers represent uniform baselines at their effective bit-widths (incorporating SpQR’s metadata overhead). TASA outperforms all 3-bit baselines and approaches FP16 reasoning accuracy at 3.5-bit [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Per-layer activation divergence heatmap. Cosine distance (1−cos) between calibration and target-task activations across layers (x-axis) under varying WikiText mix ratios (y-axis). Divergence concentrates at shallow (L0–1) and deep (L22–31) layers, whereas middle layers achieve near-perfect alignment across all configurations. The optimal ratio (marked) is highly task- and model-dependent [PITH_FULL_IMAGE… view at source ↗

**Figure 8.** Figure 8: Hessian eigenvalue decay across calibration compositions (LLaMA-3-8B). Each panel plots the top-100 normalized eigenvalues of the layer-wise Hessian (Hl = X⊤ l Xl). ER denotes the effective rank. At the reasoning-critical Layer 22, pure math data yields a sharply degenerate spectrum (ER=3) compared to pure Wiki (ER=6), confirming that task-specific data concentrates Hessian energy into fewer directions. Mi… view at source ↗

read the original abstract

Mixed-precision quantization (MPQ) has become a key technique for deploying large language models under stringent memory and compute constraints. We first identify a phenomenon that we term the Perplexity Illusion: layers ranked as important by perplexity-based sensitivity show little rank correlation with those that are most influential for complex reasoning performance, with Kendall $\tau \approx 0$ in our analysis. We further reveal an Alignment-Diversity Tradeoff: using only target-task calibration data can degrade post-quantization performance, whereas incorporating general-domain data stabilizes sensitivity estimation and improves robustness across tasks. Based on these observations, we propose TASA (Task-Aware Sensitivity Analysis), a two-level framework that jointly optimizes calibration-data composition and mixed-precision bit allocation. Specifically, TASA searches for a calibration-data mixture using a training-free gradient-trace alignment criterion, and then aggregates perplexity and reasoning-oriented sensitivity signals to guide both inter-layer and intra-layer bit allocation. Experiments on LLaMA-3-8B and Qwen2.5-7B reveal a precision inversion: appropriately allocated 3.5-bit models can match or surpass less task-aware 4-bit baselines. At an average precision of 3.5 bits, TASA matches or outperforms several competitive 4-bit uniform baselines in aggregate accuracy, and improves over the strongest W3 baseline on GSM8K by more than 20 absolute points on LLaMA-3-8B. These results show that calibration-data composition substantially affects task-sensitive quantization, a factor underexplored in prior work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TASA reports a real mismatch between perplexity sensitivity and reasoning performance plus a calibration mixture effect, but the gradient-trace search needs ablations to confirm the gains are not artifacts.

read the letter

The paper's main new points are the Perplexity Illusion (Kendall tau near zero between perplexity-ranked layers and reasoning-important ones) and the Alignment-Diversity Tradeoff (target-only calibration data hurts while a mix with general data helps). TASA then uses a training-free gradient-trace score to pick the mixture before combining perplexity and reasoning sensitivities for inter- and intra-layer bit allocation.

These observations are fresh in the MPQ literature and the reported numbers are concrete: 3.5-bit TASA matches or beats several 4-bit uniform baselines on LLaMA-3-8B and Qwen2.5-7B, with a >20-point GSM8K lift over the best W3 baseline.

The soft spot is exactly the one in the stress-test note. The headline gains depend on the gradient-trace criterion actually surfacing mixtures that improve true task accuracy rather than just the internal trace metric. The abstract gives no protocol details, no ablation on whether the selected mixtures generalize beyond the search objective, and no statistical checks. That leaves the central empirical claim under-supported until those controls appear.

This is for people working on memory-constrained LLM deployment and mixed-precision methods. The ideas are testable and the phenomena are worth checking, so it deserves a serious referee even with the current gaps in evidence.

Referee Report

2 major / 0 minor

Summary. The paper identifies a 'Perplexity Illusion' (Kendall τ ≈ 0 between perplexity sensitivity rankings and reasoning-task importance) and an 'Alignment-Diversity Tradeoff' (pure target-task calibration degrades post-quantization performance while mixing general-domain data stabilizes it). It proposes TASA, a two-level training-free framework that first searches calibration-data mixtures via a gradient-trace alignment criterion and then aggregates perplexity and reasoning sensitivities for inter- and intra-layer bit allocation. On LLaMA-3-8B and Qwen2.5-7B the method is claimed to produce 3.5-bit models that match or exceed several 4-bit uniform baselines in aggregate accuracy and improve over the strongest W3 baseline by >20 points on GSM8K.

Significance. If the reported gains are reproducible and not artifacts of the search procedure, the work would usefully highlight that calibration-data composition is a first-class, underexplored factor in task-aware MPQ and that modest average-bit reductions can still preserve complex-reasoning performance when allocation is informed by both alignment and diversity signals.

major comments (2)

[Abstract] Abstract: the central empirical claim (3.5-bit TASA matching/outperforming 4-bit baselines and +20 pt GSM8K gain on LLaMA-3-8B) rests on a two-level procedure whose first stage selects the calibration mixture by a training-free gradient-trace alignment score; the manuscript supplies neither the exact search protocol, the candidate mixture pool, nor any ablation demonstrating that the selected mixtures improve downstream task accuracy rather than merely optimizing the internal trace metric.
[Abstract] Abstract: no experimental protocol, baseline definitions, statistical tests, or ablation results are provided to support the stated improvements or the claim that the discovered mixture generalizes beyond the search objective itself, rendering the support for the Alignment-Diversity Tradeoff unverifiable from the given text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the need for greater methodological transparency. We will revise the manuscript to address the concerns by expanding key descriptions while preserving conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim (3.5-bit TASA matching/outperforming 4-bit baselines and +20 pt GSM8K gain on LLaMA-3-8B) rests on a two-level procedure whose first stage selects the calibration mixture by a training-free gradient-trace alignment score; the manuscript supplies neither the exact search protocol, the candidate mixture pool, nor any ablation demonstrating that the selected mixtures improve downstream task accuracy rather than merely optimizing the internal trace metric.

Authors: We agree the abstract is high-level and does not supply these specifics. In revision we will add a concise description of the search protocol (gradient-trace alignment computed as layer-wise cosine similarity between calibration and task gradients, optimized via a training-free grid search), the candidate mixture pool (combinations of target-task subsets and general-domain corpora), and a brief reference to ablations showing downstream accuracy gains from alignment-selected mixtures over target-only or random baselines. Full protocol and ablation tables will be expanded in the main text. revision: yes
Referee: [Abstract] Abstract: no experimental protocol, baseline definitions, statistical tests, or ablation results are provided to support the stated improvements or the claim that the discovered mixture generalizes beyond the search objective itself, rendering the support for the Alignment-Diversity Tradeoff unverifiable from the given text.

Authors: We acknowledge that the abstract provides insufficient detail on these elements, making verification difficult from the abstract alone. The revision will incorporate brief mentions of the experimental protocol (models, datasets, evaluation metrics), baseline definitions (uniform 4-bit and W3 methods from prior work), robustness via multiple random seeds, and key ablation outcomes supporting the Alignment-Diversity Tradeoff and cross-task generalization. We will also add explicit section references so readers can locate the supporting evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method and results are independent of inputs

full rationale

The paper's chain consists of empirical observations (Perplexity Illusion via Kendall τ correlation, Alignment-Diversity Tradeoff via calibration experiments) followed by a proposed two-level TASA procedure whose outputs are validated directly against downstream task accuracies on LLaMA-3-8B and Qwen2.5-7B. No equations reduce a claimed prediction to a fitted parameter by construction, no load-bearing premise rests on self-citation, and the gradient-trace alignment criterion is presented as a search heuristic whose effectiveness is measured externally rather than defined into the result. The central claim therefore remains falsifiable outside the paper's own search metric.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on empirical identification of two phenomena and a new search criterion plus aggregation rule whose effectiveness is asserted via experiments whose details are not provided in the abstract.

free parameters (2)

calibration data mixture ratio
Determined by training-free gradient-trace alignment criterion in TASA
inter/intra-layer bit allocation weights
Derived from aggregated perplexity and reasoning-oriented sensitivity signals

axioms (2)

domain assumption Gradient-trace alignment criterion reliably identifies beneficial calibration data mixtures for task performance
Invoked as the core search mechanism in the first level of TASA
domain assumption Combining perplexity and reasoning-oriented sensitivity signals produces superior bit allocations compared to single-signal baselines
Used to guide both inter-layer and intra-layer allocation in the second level of TASA

invented entities (2)

Perplexity Illusion no independent evidence
purpose: Describes the observed near-zero rank correlation between perplexity sensitivity and reasoning-task influence
Newly identified phenomenon used to motivate the method
Alignment-Diversity Tradeoff no independent evidence
purpose: Describes the observed degradation from target-task-only calibration versus stabilization from general-domain data
Newly identified phenomenon used to motivate the method

pith-pipeline@v0.9.1-grok · 5822 in / 1626 out tokens · 37314 ms · 2026-07-02T15:47:39.507330+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 17 canonical work pages · 12 internal anchors

[1]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaif, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

Jianing Deng, Song Wang, Dongwei Wang, Zijie Liu, Tianlong Chen, Huanrui Yang, and Jingtong Hu. GEMQ: Global expert-level mixed-precision quantization for MoE LLMs.arXiv preprint arXiv:2605.23078,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Layer-wise quantization: A pragmatic and effective method for quantizing LLMs beyond integer bit-levels.arXiv preprint arXiv:2406.17415,

Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, and Mihai Surdeanu. Layer-wise quantization: A pragmatic and effective method for quantizing LLMs beyond integer bit-levels.arXiv preprint arXiv:2406.17415,

work page arXiv
[6]

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao

URLhttps://zenodo.org/records/12608602. Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. InInternational Conference on Machine Learning, pp. 2232– 2241,

work page arXiv
[7]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, An- gela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, et al. The llama 3 herd of models.arXiv preprint arX...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

12 Preprint. Under review. Amit LeVi, Raz Lapid, Rom Himelstein, Chaim Baskin, Ravid Shwartz-Ziv, and Avi Mendelson. You had one job: Per-task quantization using LLMs’ hidden representations.arXiv preprint arXiv:2511.06516,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035,

Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, and Hongxia Yang. Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035,

work page arXiv
[10]

OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

Zhikai Li, Zhen Dong, Xuewen Liu, Jing Zhang, and Qingyi Gu. OSAQ: Outlier self-absorption for accurate low-bit LLM quantization.arXiv preprint arXiv:2605.04738,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models

Ryan Lucas, Mehdi Makni, Xiang Meng, Adam Deng, and Rahul Mazumder. ADMM-Q: An im- proved hessian-based weight quantizer for post-training quantization of large language models. arXiv preprint arXiv:2605.11222,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Self-calibration for language model quantization and pruning

Miles Williams, George Chrysostomou, and Nikolaos Aletras. Self-calibration for language model quantization and pruning. InProceedings of the 2025 Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 10149–10167,

2025
[13]

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Yuzhuang Xu, Xu Han, Yuxuan Li, Pengzhan Li, and Wanxiang Che. Fitting is not enough: Smooth- ness in extremely quantized LLMs.arXiv preprint arXiv:2605.08894,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Qwen2.5 Technical Report

13 Preprint. Under review. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runj...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

CoopQ: Cooperative game inspired layerwise mixed precision quantization for LLMs

Junchen Zhao, Ali Derakhshan, Jayden Kana Hyman, Junhao Dong, Sangeetha Abdu Jyothi, and Ian Harris. CoopQ: Cooperative game inspired layerwise mixed precision quantization for LLMs. arXiv preprint arXiv:2509.15455,

work page arXiv
[16]

Saliency-Aware Regularized Quantization Calibration for Large Language Models

Yanlong Zhao, Xiaoyuan Cheng, Huihang Liu, Baihua He, Xinyu Zhang, Harrison Bo Hua Zhu, Wenlong Chen, Li Zeng, and Zhuo Sun. Saliency-aware regularized quantization calibration for large language models.arXiv preprint arXiv:2605.05693,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

Chenxi Zhou, Pengfei Cao, Jiang Li, Bohan Yu, Jinyu Ye, Jun Zhao, and Kang Liu. From signal degradation to computation collapse: Uncovering the two failure modes of LLM quantization. arXiv preprint arXiv:2604.19884,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Under review

14 Preprint. Under review. Table 4: Comparison of mixed-precision quantization approaches along three design dimensions. TASA jointly optimizes the sensitivity metric, calibration data, and allocation granularity in a task- aware manner. Method Sensitivity Metric Calib Data Granularity Task-Aware? HAWQ Hessian eigenvalue Generic Layer No HAWQ-V2 Hessian t...

2048
[19]

We delib- 15 Preprint

while keeping all other layers at FP16, and measure the resulting degradation in three metrics. We delib- 15 Preprint. Under review. Table 5: Baseline implementation details. All methods use group sizeg= 128and WikiText-2 calibration with 128 samples. Method Implementation Notes RTN Custom (symmetric MinMax) Standard round-to-nearest with per-group scales...

2023
[20]

(2024), group size 16, bilevel 3-bit OWQ Official codebase Lee et al

SpQR Official codebase Dettmers et al. (2024), group size 16, bilevel 3-bit OWQ Official codebase Lee et al. (2024), default configuration SliM-LLM Official codebase Huang et al. (2025), integrated into our pipeline erately use RTN rather than GPTQ for profiling: RTN applies a pure rounding perturbation without second-order error compensation, so the meas...

2024
[21]

Baseline implementations.Tab

This procedure yields a sensitivity vectors (k) ∈R L for each objectivek∈ {ppl,math,arc}, where each entry records the contribution of layerlto objectivekunder quantization perturbation. Baseline implementations.Tab. 5 lists the implementation source for each baseline method. All baselines use the same group size (g= 128), calibration data (WikiText-2,n= ...

2023
[22]

The time complexity isO(L· Btotal · |B|), which is negligible (<1second for all configurations). The resulting allocation produces a heterogeneous per-layer bit pattern that interleaves 3-bit, 4-bit, and occasionally 8-bit layers according to the multi-objective sensitivity landscape (see Tab. 15). For instance, at a 3.5-bit average budget on LLaMA-3-8B, ...

2025
[23]

bottleneck

for completeness. B.5 CROSS-TASKOVERLAPANALYSIS A set-theoretic analysis of the top-Kmost sensitive layers corroborates the rank correlation results discussed in the main text. On LLaMA-3-8B, the top-8 PPL-sensitive layers are{1,3,6,16,18,20,21,31}. In contrast, the top-8 math-sensitive layers are{2,10,11,12,13,14,20,31}, and the top-8 ARC-sensitive layer...

work page arXiv
[24]

We note that the A100 does not support native INT4/INT3 compute; all quantized operations are performed with 26 Preprint

for benchmarking. We note that the A100 does not support native INT4/INT3 compute; all quantized operations are performed with 26 Preprint. Under review. Algorithm 1TASA: Task-Aware Sensitivity Analysis Input: Modelθ, layersL, bit candidatesB, budgetB, general dataD g, task dataD t, candidate ratiosA, MOA weightβ Output: Mixed-precision quantized model ˆθ...

2023
[25]

task circuits,

takes a different approach to task-aware quantization. Rather than balanc- ing alignment and diversity at the calibration level, TACQ uses backward-pass gradient attribution to identify “task circuits,” the 0.35% of weight parameters most critical for a specific target task, and preserves them at higher precision while uniformly quantizing the remainder t...

2025
[26]

and its modern extension to LLM quantization (Frantar et al., 2023). Following standard assumptions in post-training quantization, the raw rounding errorϵfor each weight column is modeled as uncorrelated zero-mean noise with varianceσ 2 q determined by the bit-width. Under OBS, however, quantizing each weight triggers a compensating update to the re- main...

2023
[27]

whereH α =αH gen + (1−α)H task is the calibration Hessian under mixing ratioα, andH test = X⊤ testXtest/nis the Hessian on the test distribution

exhibit reversed behavior due to their high sensitivity to token distributions. whereH α =αH gen + (1−α)H task is the calibration Hessian under mixing ratioα, andH test = X⊤ testXtest/nis the Hessian on the test distribution. This trace formula admits a spectral decomposition that reveals a classicalbias-variance dilemma. Let{(λ i,u i)}be the eigen-pairs ...

2018
[28]

and matrix analysis (Bhatia, 2007). SinceHα is a positive affine function of αand is strictly positive definite for allα∈(0,1](guaranteed by Item (i)), the composite function f(α) = tr(H test H−1 α )is strictly convex on(0,1]. Step 2: Boundary behavior asα→0.Asα→0, we haveH α →H task. Consider the spectral contribution of the trailing eigenvectorvfrom Ite...

2007

[1] [1]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaif, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

Jianing Deng, Song Wang, Dongwei Wang, Zijie Liu, Tianlong Chen, Huanrui Yang, and Jingtong Hu. GEMQ: Global expert-level mixed-precision quantization for MoE LLMs.arXiv preprint arXiv:2605.23078,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Layer-wise quantization: A pragmatic and effective method for quantizing LLMs beyond integer bit-levels.arXiv preprint arXiv:2406.17415,

Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, and Mihai Surdeanu. Layer-wise quantization: A pragmatic and effective method for quantizing LLMs beyond integer bit-levels.arXiv preprint arXiv:2406.17415,

work page arXiv

[6] [6]

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao

URLhttps://zenodo.org/records/12608602. Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. InInternational Conference on Machine Learning, pp. 2232– 2241,

work page arXiv

[7] [7]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, An- gela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, et al. The llama 3 herd of models.arXiv preprint arX...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

12 Preprint. Under review. Amit LeVi, Raz Lapid, Rom Himelstein, Chaim Baskin, Ravid Shwartz-Ziv, and Avi Mendelson. You had one job: Per-task quantization using LLMs’ hidden representations.arXiv preprint arXiv:2511.06516,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035,

Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, and Hongxia Yang. Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035,

work page arXiv

[10] [10]

OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

Zhikai Li, Zhen Dong, Xuewen Liu, Jing Zhang, and Qingyi Gu. OSAQ: Outlier self-absorption for accurate low-bit LLM quantization.arXiv preprint arXiv:2605.04738,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models

Ryan Lucas, Mehdi Makni, Xiang Meng, Adam Deng, and Rahul Mazumder. ADMM-Q: An im- proved hessian-based weight quantizer for post-training quantization of large language models. arXiv preprint arXiv:2605.11222,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Self-calibration for language model quantization and pruning

Miles Williams, George Chrysostomou, and Nikolaos Aletras. Self-calibration for language model quantization and pruning. InProceedings of the 2025 Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 10149–10167,

2025

[13] [13]

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Yuzhuang Xu, Xu Han, Yuxuan Li, Pengzhan Li, and Wanxiang Che. Fitting is not enough: Smooth- ness in extremely quantized LLMs.arXiv preprint arXiv:2605.08894,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Qwen2.5 Technical Report

13 Preprint. Under review. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runj...

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

CoopQ: Cooperative game inspired layerwise mixed precision quantization for LLMs

Junchen Zhao, Ali Derakhshan, Jayden Kana Hyman, Junhao Dong, Sangeetha Abdu Jyothi, and Ian Harris. CoopQ: Cooperative game inspired layerwise mixed precision quantization for LLMs. arXiv preprint arXiv:2509.15455,

work page arXiv

[16] [16]

Saliency-Aware Regularized Quantization Calibration for Large Language Models

Yanlong Zhao, Xiaoyuan Cheng, Huihang Liu, Baihua He, Xinyu Zhang, Harrison Bo Hua Zhu, Wenlong Chen, Li Zeng, and Zhuo Sun. Saliency-aware regularized quantization calibration for large language models.arXiv preprint arXiv:2605.05693,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

Chenxi Zhou, Pengfei Cao, Jiang Li, Bohan Yu, Jinyu Ye, Jun Zhao, and Kang Liu. From signal degradation to computation collapse: Uncovering the two failure modes of LLM quantization. arXiv preprint arXiv:2604.19884,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Under review

14 Preprint. Under review. Table 4: Comparison of mixed-precision quantization approaches along three design dimensions. TASA jointly optimizes the sensitivity metric, calibration data, and allocation granularity in a task- aware manner. Method Sensitivity Metric Calib Data Granularity Task-Aware? HAWQ Hessian eigenvalue Generic Layer No HAWQ-V2 Hessian t...

2048

[19] [19]

We delib- 15 Preprint

while keeping all other layers at FP16, and measure the resulting degradation in three metrics. We delib- 15 Preprint. Under review. Table 5: Baseline implementation details. All methods use group sizeg= 128and WikiText-2 calibration with 128 samples. Method Implementation Notes RTN Custom (symmetric MinMax) Standard round-to-nearest with per-group scales...

2023

[20] [20]

(2024), group size 16, bilevel 3-bit OWQ Official codebase Lee et al

SpQR Official codebase Dettmers et al. (2024), group size 16, bilevel 3-bit OWQ Official codebase Lee et al. (2024), default configuration SliM-LLM Official codebase Huang et al. (2025), integrated into our pipeline erately use RTN rather than GPTQ for profiling: RTN applies a pure rounding perturbation without second-order error compensation, so the meas...

2024

[21] [21]

Baseline implementations.Tab

This procedure yields a sensitivity vectors (k) ∈R L for each objectivek∈ {ppl,math,arc}, where each entry records the contribution of layerlto objectivekunder quantization perturbation. Baseline implementations.Tab. 5 lists the implementation source for each baseline method. All baselines use the same group size (g= 128), calibration data (WikiText-2,n= ...

2023

[22] [22]

The time complexity isO(L· Btotal · |B|), which is negligible (<1second for all configurations). The resulting allocation produces a heterogeneous per-layer bit pattern that interleaves 3-bit, 4-bit, and occasionally 8-bit layers according to the multi-objective sensitivity landscape (see Tab. 15). For instance, at a 3.5-bit average budget on LLaMA-3-8B, ...

2025

[23] [23]

bottleneck

for completeness. B.5 CROSS-TASKOVERLAPANALYSIS A set-theoretic analysis of the top-Kmost sensitive layers corroborates the rank correlation results discussed in the main text. On LLaMA-3-8B, the top-8 PPL-sensitive layers are{1,3,6,16,18,20,21,31}. In contrast, the top-8 math-sensitive layers are{2,10,11,12,13,14,20,31}, and the top-8 ARC-sensitive layer...

work page arXiv

[24] [24]

We note that the A100 does not support native INT4/INT3 compute; all quantized operations are performed with 26 Preprint

for benchmarking. We note that the A100 does not support native INT4/INT3 compute; all quantized operations are performed with 26 Preprint. Under review. Algorithm 1TASA: Task-Aware Sensitivity Analysis Input: Modelθ, layersL, bit candidatesB, budgetB, general dataD g, task dataD t, candidate ratiosA, MOA weightβ Output: Mixed-precision quantized model ˆθ...

2023

[25] [25]

task circuits,

takes a different approach to task-aware quantization. Rather than balanc- ing alignment and diversity at the calibration level, TACQ uses backward-pass gradient attribution to identify “task circuits,” the 0.35% of weight parameters most critical for a specific target task, and preserves them at higher precision while uniformly quantizing the remainder t...

2025

[26] [26]

and its modern extension to LLM quantization (Frantar et al., 2023). Following standard assumptions in post-training quantization, the raw rounding errorϵfor each weight column is modeled as uncorrelated zero-mean noise with varianceσ 2 q determined by the bit-width. Under OBS, however, quantizing each weight triggers a compensating update to the re- main...

2023

[27] [27]

whereH α =αH gen + (1−α)H task is the calibration Hessian under mixing ratioα, andH test = X⊤ testXtest/nis the Hessian on the test distribution

exhibit reversed behavior due to their high sensitivity to token distributions. whereH α =αH gen + (1−α)H task is the calibration Hessian under mixing ratioα, andH test = X⊤ testXtest/nis the Hessian on the test distribution. This trace formula admits a spectral decomposition that reveals a classicalbias-variance dilemma. Let{(λ i,u i)}be the eigen-pairs ...

2018

[28] [28]

and matrix analysis (Bhatia, 2007). SinceHα is a positive affine function of αand is strictly positive definite for allα∈(0,1](guaranteed by Item (i)), the composite function f(α) = tr(H test H−1 α )is strictly convex on(0,1]. Step 2: Boundary behavior asα→0.Asα→0, we haveH α →H task. Consider the spectral contribution of the trailing eigenvectorvfrom Ite...

2007