pith. sign in

arxiv: 2606.19558 · v1 · pith:LGXYP6KHnew · submitted 2026-06-17 · 💻 cs.LG · cs.CL

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

Pith reviewed 2026-06-26 20:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords KL divergencemodel quantizationLLM fidelity metricsbenchmark correlationquantized deploymentsilent zonedisagreement volumedirection of error
0
0 comments X

The pith

KL divergence correlates with LLM benchmark scores overall but loses all correlation near the full-precision baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether common fidelity metrics such as per-token KL divergence can serve as cheap proxies for actual downstream benchmark performance when deploying quantized large language models. Across two large quantization cohorts it finds strong negative correlations with benchmark scores when including heavily quantized models, yet this correlation drops to statistical insignificance once the analysis is restricted to models whose outputs remain close to the original high-precision reference. The authors attribute the collapse to a structural property: KL divergence mainly registers the total volume of probability shifts away from the reference rather than whether those shifts move outputs toward or away from correct answers on specific tasks. The pattern is stable across fourteen different ways of computing or aggregating the metric.

Core claim

KLD is strongly correlated with benchmark score over the full cohort (ρ=-0.72 on Qwen and ρ=-0.86 on Devstral, both with p<0.001). However, this relationship collapses to non-significance in the near-baseline silent zone (ρ=+0.00 on Qwen and ρ=-0.24, p=0.36, on Devstral). KLD primarily measures the volume of disagreement with the reference, with silent-zone composite ρ=+0.94 (p<0.001) on Qwen and +0.55 (p=0.03) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.

What carries the argument

The volume-versus-direction decomposition of per-token KL divergence, separating the extent of token-level probability changes from whether those changes improve or degrade task accuracy.

If this is right

  • KLD cannot reliably rank or select among near-lossless quantizations.
  • Per-prompt KLD shows only weak power to flag individual prompt failures on code tasks.
  • Using KLD to route between models on disagreement prompts yields accuracy only slightly above random.
  • The loss of predictive power holds for multiple KLD aggregations, perplexity variants, and calibration settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment decisions in the high-fidelity regime may need metrics that explicitly track error direction rather than total displacement.
  • Direct benchmark sweeps remain necessary when candidate models differ only in small distributional ways.
  • Task-specific or direction-aware fidelity measures could be developed by weighting disagreements according to their effect on labeled examples.

Load-bearing premise

The chosen downstream benchmarks and the definition of the near-baseline silent zone supply an unbiased ground truth for whether small distributional shifts affect task performance.

What would settle it

Re-running the correlation analysis on a fresh set of near-baseline quantizations or models and obtaining a statistically significant negative correlation between KLD and benchmark scores.

Figures

Figures reproduced from arXiv: 2606.19558 by Ali Hadi Zadeh, Andreas Moshovos, Enrique Torres Sanchez, Milo\v{s} Nikoli\'c.

Figure 1
Figure 1. Figure 1: The silent-zone collapse on both cohorts, with Spearman correlation coefficients [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-prompt KLD distribution on Live￾CodeBench, split by each model’s own pass (green) / fail (red) outcomes, for five silent-zone Qwen quants. a dataset-aggregate scalar, we feed the model ex￾act benchmark prompts, calculate KLD on the re￾sponses and compare to the benchmark pass/fail outcomes. If KLD had any predictive power in the silent-zone, we should be able to identify which prompts a given quant wil… view at source ↗
Figure 3
Figure 3. Figure 3: Silent-zone Spearman ρ(KLD, ·) per bench￾mark for volume (blue) and direction f (orange). they remain effective coarse degradation detectors. The failure mode is narrower: among near-baseline quants, where practitioners choose between plausi￾ble candidates, we do not observe a reliable ranking signal. The silent-zone thresholds are descriptive and cohort-specific: they depend on the reference model, quanti… view at source ↗
read the original abstract

Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated across a suite of downstream benchmarks. We find that KLD is strongly correlated with benchmark score over the full cohort ($\rho=-0.72$ on Qwen and $\rho=-0.86$ on Devstral, both with $p<0.001$). However, this relationship collapses to non-significance in the near-baseline silent zone ($\rho=+0.00$ on Qwen and $\rho=-0.24$, $p=0.36$, on Devstral). This collapse persists across 14 measurement variants, including different KLD aggregations, perplexity formulations, top-1 agreement, calibration corpora, and context lengths. At the per-prompt level, KLD has only weak failure-prediction power on code, with failed-vs-passed geometric-mean ratios in $[1.08,1.22]$ across five models on LiveCodeBench, and fails as a cross-model router, achieving only $42.3\%-49.4\%$ accuracy on disagreement prompts. We trace the collapse to a structural decomposition: KLD primarily measures the volume of disagreement with the reference, with silent-zone composite $\rho=+0.94$ ($p<0.001$) on Qwen and $+0.55$ ($p=0.03$) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates fidelity metrics such as per-token KL divergence (KLD) as proxies for downstream benchmark performance in quantized LLMs. Using cohorts of 28 and 41 quantized variants of Qwen3.6-35B-A3B and Devstral-Small-2-24B respectively, it reports strong negative correlations between KLD and benchmark scores over the full cohorts (ρ = -0.72 and -0.86, p < 0.001), but finds these correlations collapse to non-significance inside a 'near-baseline silent zone' (ρ = 0.00 and -0.24, p = 0.36). The collapse is claimed to persist across 14 measurement variants (different KLD aggregations, perplexity, top-1 agreement, corpora, context lengths). Additional results show weak per-prompt failure prediction on LiveCodeBench and poor cross-model routing accuracy (42.3–49.4%). The authors decompose KLD into a volume-of-disagreement component (strongly correlated with the zone composite) versus a direction component (weak and task-conditional).

Significance. If the central empirical observation holds after addressing definitional issues, the work would be significant for quantized LLM deployment practice: it supplies concrete evidence that common low-cost proxies like KLD lose predictive value precisely in the low-error regime that matters most for production use. The volume-versus-direction decomposition offers a reusable conceptual lens. The study is strengthened by its use of two distinct model families, explicit correlation coefficients with p-values, and explicit testing across multiple measurement variants; these elements make the reported attenuation falsifiable and reproducible in principle.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (silent-zone construction): the 'near-baseline silent zone' is introduced without a pre-specified threshold, variance cutoff, or hold-out procedure for its boundaries. Because the full-cohort ρ is driven by the tail of severely degraded quantizations, restricting analysis to the low-variance interior can mechanically attenuate the correlation even if KLD continues to track residual differences; the claim that the collapse 'persists across 14 measurement variants' does not mitigate this, as all variants share the same post-selected subset.
  2. [Abstract] Abstract: the per-prompt failure-prediction results (geometric-mean ratios [1.08,1.22] on LiveCodeBench) and routing accuracy (42.3–49.4%) are presented as supporting evidence that KLD fails as a fine-grained signal, yet no ablation is shown that isolates whether these weak numbers arise from the same volume/direction decomposition or from benchmark-specific noise; without that link the per-prompt findings remain only loosely connected to the headline correlation-collapse claim.
minor comments (2)
  1. Table or figure captions should explicitly state the exact cohort sizes (28 and 41) and the precise benchmark suite used for the reported ρ values.
  2. Notation for the 14 variants is summarized in the abstract but would benefit from a single consolidated table listing each variant, its aggregation method, and the resulting silent-zone ρ.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the silent-zone construction and the linkage of per-prompt results. We address each point below with plans for revision where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (silent-zone construction): the 'near-baseline silent zone' is introduced without a pre-specified threshold, variance cutoff, or hold-out procedure for its boundaries. Because the full-cohort ρ is driven by the tail of severely degraded quantizations, restricting analysis to the low-variance interior can mechanically attenuate the correlation even if KLD continues to track residual differences; the claim that the collapse 'persists across 14 measurement variants' does not mitigate this, as all variants share the same post-selected subset.

    Authors: We agree that the zone definition was post-hoc and could introduce selection bias. In revision we will pre-specify the silent zone via a fixed criterion (benchmark degradation ≤2% relative to FP16 baseline) and add sensitivity analyses using alternative thresholds plus a hold-out procedure for zone boundaries. This will test whether the correlation collapse holds independently of the shared subset. revision: yes

  2. Referee: [Abstract] Abstract: the per-prompt failure-prediction results (geometric-mean ratios [1.08,1.22] on LiveCodeBench) and routing accuracy (42.3–49.4%) are presented as supporting evidence that KLD fails as a fine-grained signal, yet no ablation is shown that isolates whether these weak numbers arise from the same volume/direction decomposition or from benchmark-specific noise; without that link the per-prompt findings remain only loosely connected to the headline correlation-collapse claim.

    Authors: We acknowledge the connection is not explicit. The per-prompt results illustrate KLD's practical shortcomings, but we did not ablate their relation to the volume/direction split. In revision we will add a discussion paragraph noting that the volume component (high correlation with the zone composite) likely explains the weak per-prompt signal, while direction remains task-conditional; we will also flag benchmark noise as a possible confounder. revision: partial

Circularity Check

0 steps flagged

No circularity; direct empirical measurement study

full rationale

The paper reports empirical correlations (ρ values) between KLD and benchmark scores on fixed quantization cohorts, with a post-hoc partition into a 'near-baseline silent zone' based on proximity to the unquantized reference. No equations, derivations, fitted parameters, or self-citations are present that reduce any reported result to an input by construction. The correlation collapse is a measured outcome on the partitioned data, not a self-definitional or fitted-input prediction. This matches the default case of a self-contained empirical study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard statistical correlation measures and existing downstream benchmarks without introducing new free parameters, axioms beyond basic statistical assumptions, or invented entities.

axioms (2)
  • standard math Pearson correlation and associated p-values are appropriate for relating KLD values to benchmark scores under the observed data distributions.
    Used to report ρ and p-values throughout the abstract.
  • domain assumption The selected downstream benchmarks serve as valid ground-truth measures of model quality for the purpose of evaluating fidelity metrics.
    Implicit when treating benchmark score as the reference against which KLD is judged.

pith-pipeline@v0.9.1-grok · 5865 in / 1288 out tokens · 35772 ms · 2026-06-26T20:56:58.578056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 3 canonical work pages

  1. [1]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Accuracy is Not All You Need , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  2. [2]

    ``Give Me BF 16 or Give Me Death''? Accuracy-Performance Trade-Offs in LLM Quantization

    Kurtic, Eldar and Marques, Alexandre Noll and Pandit, Shubhra and Kurtz, Mark and Alistarh, Dan. ``Give Me BF 16 or Give Me Death''? Accuracy-Performance Trade-Offs in LLM Quantization. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1304

  3. [3]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  4. [4]

    2025 , url=

    Naman Jain and King Han and Alex Gu and Wen-Ding Li and Fanjia Yan and Tianjun Zhang and Sida Wang and Armando Solar-Lezama and Koushik Sen and Ion Stoica , booktitle=. 2025 , url=

  5. [5]

    2023 , eprint=

    Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

  6. [6]

    Gonzalez , booktitle=

    Shishir G Patil and Huanzhi Mao and Fanjia Yan and Charlie Cheng-Jie Ji and Vishnu Suresh and Ion Stoica and Joseph E. Gonzalez , booktitle=. The Berkeley Function Calling Leaderboard (. 2025 , url=

  7. [7]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  8. [8]

    International Conference on Learning Representations , year=

    Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

  9. [9]

    International Conference on Learning Representations , year=

    Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=

  10. [10]

    2024 IEEE International Symposium on Circuits and Systems (ISCAS) , title=

    Nikoli. 2024 IEEE International Symposium on Circuits and Systems (ISCAS) , title=. 2024 , volume=

  11. [11]

    2412.14590 , archivePrefix=

    Zhen Zheng and Xiaonan Song and Chuanjie Liu , year=. 2412.14590 , archivePrefix=

  12. [12]

    Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song , booktitle=

  13. [13]

    Xiao, Guangxuan and Lin, Ji and Seznec, Mickael and Wu, Hao and Demouth, Julien and Han, Song , booktitle =

  14. [14]

    2023 , url =

    Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan , booktitle =. 2023 , url =

  15. [15]

    Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

    Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , month = aug, note =. doi:10.24963/ijcai.2025/902 , url =

  16. [16]

    2509.25160 , archivePrefix=

    Fan Yuan and Yuchen Yan and Yifan Jiang and Haoran Zhao and Tao Feng and Jinyan Chen and Yanwei Lou and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang , year=. 2509.25160 , archivePrefix=

  17. [17]

    arXiv preprint arXiv:2305.20050 , year=

    Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

  18. [18]

    Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization

    Deiseroth, Bj. Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.377

  19. [19]

    Compressing

    Ajay Kumar Jaiswal and Zhe Gan and Xianzhi Du and Bowen Zhang and Zhangyang Wang and Yinfei Yang , booktitle=. Compressing. 2024 , url=

  20. [20]

    2023 , url=

    llama.cpp , author=. 2023 , url=