pith. machine review for the scientific record. sign in

arxiv: 2602.10144 · v2 · submitted 2026-02-09 · 📊 stat.ML · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

When LLMs get significantly worse: A statistical approach to detect model degradations

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:00 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG
keywords LLM optimizationmodel degradationMcNemar's testhypothesis testingstatistical evaluationinference optimizationaccuracy comparison
0
0 comments X

The pith

A hypothesis test on paired per-sample outputs detects real LLM degradations after optimization while controlling false positives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a McNemar's test framework that compares original and optimized model outputs sample by sample rather than on aggregate accuracy. This matters because even theoretically lossless optimizations can produce different generations due to numerical noise, making small accuracy changes hard to interpret without statistics. By focusing on disagreements where one model succeeds and the other fails, the test distinguishes actual quality loss from harmless evaluation variance. The method also supplies three ways to combine results across benchmarks and includes an implementation that correctly identifies degradations as small as 0.3 percent in case studies while passing provably lossless changes.

Core claim

The central claim is that confronting models on each individual sample with McNemar's test yields a statistically valid procedure for declaring degradation, because it directly counts the discordant pairs where the optimized model errs and the original succeeds (or vice versa) and tests whether that imbalance exceeds what random noise would produce, while the aggregation rules across tasks preserve the overall false-positive guarantee.

What carries the argument

McNemar's test on paired per-sample correctness indicators, which counts cases of disagreement between the two models and tests whether one model is systematically better.

If this is right

  • Accuracy drops of 0.3 percent can be declared statistically significant degradations rather than noise.
  • The procedure controls the overall false-positive rate when decisions are made across multiple benchmarks.
  • Three aggregation methods allow a single accept/reject decision while preserving the per-test guarantees.
  • The test distinguishes harmless numerical variation from genuine quality loss in temperature-zero evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same paired-sample logic could be applied to detect degradation after fine-tuning or continued pre-training rather than only inference optimizations.
  • Adoption would likely make practitioners more cautious about aggressive quantization or pruning once small drops become detectable.
  • The framework assumes evaluation data remain stationary; any drift in the test set itself would require additional corrections not addressed here.

Load-bearing premise

That observed differences between the paired outputs of the original and optimized models arise only from the optimization itself and satisfy the exchangeability conditions required by McNemar's test.

What would settle it

Run the test on a model pair known to be identical except for a provably lossless transformation that introduces no accuracy change; if the procedure rejects the null at the target significance level, the framework fails to control false positives.

Figures

Figures reproduced from arXiv: 2602.10144 by Ashish Khetan, George Karypis, Jonas K\"ubler, Junming Yin, Kailash Budhathoki, Matth\"aus Kleindessner, Xiong Zhou.

Figure 1
Figure 1. Figure 1: Detecting accuracy degradation on Llama-3.1 8B Instruct based on empirical estimates [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Success histogram for MMLU-Pro [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Rejection rates of the proposed aggregation schemes dependent on the number of tasks [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Asymptotic test power for N = 25, 282 and α = 0.05 as function of flip probability p↕ and degradation probability q↓. We also additionally show the rejection rates of an aggregated test that rejects if any of the three tests rejects at α = 5%. This test of course does not control correctly at the significance level, however, it captures all potential scenarios. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: p-values of the pooled test as a function of sample size N, empirical accuracy degrada￾tion ˆδ and different flip probabilities. Observations above the dotted line are rejected as significant degradations at α = 5%. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
read the original abstract

Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to ensure that the model quality has not degraded. However, even at temperature zero, model generations are not necessarily robust even to theoretically lossless model optimizations due to numerical errors. We thus require statistical tools to decide whether a finite-sample accuracy deviation is an evidence of a model's degradation or whether it can be attributed to (harmless) noise in the evaluation. We propose a statistically sound hypothesis testing framework based on McNemar's test allowing to efficiently detect model degradations, while guaranteeing a controlled rate of false positives. The crucial insight is that we have to confront the model scores on each sample, rather than aggregated on the task level. Furthermore, we propose three approaches to aggregate accuracy estimates across multiple benchmarks into a single decision. We provide an implementation on top of the largely adopted open source LM Evaluation Harness and provide a case study illustrating that the method correctly flags degraded models, while not flagging model optimizations that are provably lossless. We find that with our tests even empirical accuracy degradations of 0.3% can be confidently attributed to actual degradations rather than noise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a statistical hypothesis testing framework using McNemar's test on paired per-sample correct/incorrect labels to determine whether observed accuracy differences between an original LLM and an optimized version indicate true degradation or can be attributed to evaluation noise. The approach aims to control the false positive rate, proposes aggregation methods across benchmarks, provides an open-source implementation in the LM Evaluation Harness, and includes a case study demonstrating its ability to detect small degradations (0.3%) while sparing provably lossless optimizations.

Significance. This work addresses an important practical problem in LLM optimization research by offering a rigorous statistical method to validate model quality post-optimization. If the assumptions of McNemar's test are satisfied in the benchmark setting, the framework could become a standard tool for distinguishing signal from noise in accuracy evaluations. The provision of code in the LM Evaluation Harness and the empirical case study are notable strengths that enhance reproducibility and applicability.

major comments (2)
  1. [§3] §3 (McNemar test application): The claim of a controlled false-positive rate relies on the asymptotic chi-squared distribution of the McNemar statistic ((b-c)^2/(b+c)), which requires independent pairs. LLM benchmarks routinely contain correlated samples (multiple questions from the same passage or similar prompts), violating independence and potentially inflating type I error above the nominal alpha. This directly undermines the central guarantee and is not addressed by the case study on lossless models.
  2. [§4] §4 (aggregation across benchmarks): The three proposed aggregation approaches are not shown to preserve the type I error control when combining p-values or decisions from dependent benchmarks; without explicit multiple-testing correction or simulation under realistic dependence, the single-decision claim lacks support.
minor comments (2)
  1. [Abstract] Abstract: The three aggregation approaches are mentioned but not named or briefly described, reducing immediate clarity.
  2. [Implementation] Implementation section: No usage example or pseudocode is provided despite the open-source claim; a short snippet would aid adoption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with specific plans for revision to strengthen the statistical foundations and empirical validation of the proposed framework.

read point-by-point responses
  1. Referee: [§3] §3 (McNemar test application): The claim of a controlled false-positive rate relies on the asymptotic chi-squared distribution of the McNemar statistic ((b-c)^2/(b+c)), which requires independent pairs. LLM benchmarks routinely contain correlated samples (multiple questions from the same passage or similar prompts), violating independence and potentially inflating type I error above the nominal alpha. This directly undermines the central guarantee and is not addressed by the case study on lossless models.

    Authors: We appreciate the referee's identification of this key assumption. The standard McNemar test and its asymptotic chi-squared approximation do require independent paired observations, and LLM benchmarks commonly feature dependence arising from shared passages, similar prompts, or related questions. Our case study illustrates behavior on provably lossless optimizations but does not quantify type I error inflation under realistic dependence. In the revised manuscript we will expand §3 with an explicit discussion of this limitation. We will also add Monte Carlo simulation experiments that inject controlled dependence structures (e.g., by correlating per-sample error indicators across related items) to measure the realized false-positive rate. If the simulations reveal material inflation, we will report conservative alpha adjustments or alternative procedures; otherwise we will document the observed robustness. revision: yes

  2. Referee: [§4] §4 (aggregation across benchmarks): The three proposed aggregation approaches are not shown to preserve the type I error control when combining p-values or decisions from dependent benchmarks; without explicit multiple-testing correction or simulation under realistic dependence, the single-decision claim lacks support.

    Authors: We concur that the aggregation methods require explicit validation under inter-benchmark dependence to support the single-decision guarantee. The current manuscript presents the three aggregation strategies without such checks. We will revise §4 to include simulation studies that generate dependent p-values or binary decisions across benchmarks (modeling realistic correlation induced by shared model behavior). These experiments will verify whether the combined type I error stays at or below the nominal level. Where appropriate, we will incorporate standard multiple-testing corrections (e.g., Bonferroni or FDR control) into the aggregation rules and report the resulting operating characteristics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard McNemar application to paired samples

full rationale

The paper proposes applying the established McNemar's test directly to paired per-sample correct/incorrect outcomes between two models, without deriving new statistics, fitting parameters that are then called predictions, or relying on self-citations for the core validity claim. The hypothesis-testing guarantee follows from the standard asymptotic chi-squared property of the McNemar statistic under its usual assumptions, which are external to the paper. No equations reduce the proposed decision rule to the input data by construction, and the aggregation methods across benchmarks are presented as straightforward combinations rather than self-referential derivations. The work is therefore self-contained as an application of existing statistical tools.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on the standard assumptions of McNemar's test for paired binary outcomes and on the premise that benchmark samples are sufficiently representative.

axioms (1)
  • domain assumption McNemar's test assumptions hold for paired LLM outputs on identical prompts
    The test requires that the paired observations are independent across samples and that the only systematic difference is the model optimization.

pith-pipeline@v0.9.0 · 5559 in / 1135 out tokens · 53763 ms · 2026-05-16T06:00:36.116161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 12 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv:2508.10925,

  2. [2]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau1, Laurent Sifre1, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv 2302.01318,

  3. [3]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv 2407.21783,

  4. [4]

    9 ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al

    URLhttps://zenodo.org/records/12608602. Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, and Bharath K Sriperumbudur. Optimal kernel choice for large-scale two-sample tests.Advances in neural information processing systems, 25,

  5. [5]

    K¨ubler, Yu-Xiang Wang, Shoham Sabach, Navid Ansari, Matth¨aus Kleindessner, Kailash Budhathoki, V olkan Cevher, and George Karypis

    11 Published as a conference paper at ICLR 2026 Jonas M. K¨ubler, Yu-Xiang Wang, Shoham Sabach, Navid Ansari, Matth¨aus Kleindessner, Kailash Budhathoki, V olkan Cevher, and George Karypis. A proximal operator for inducing 2:4-sparsity. Transactions on Machine Learning Research,

  6. [6]

    Solving Quantitative Reasoning Problems with Language Models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dye, Henryk Michalewski, Vinay Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with lan- guage models.arXiv:2206.14858,

  7. [7]

    DeepSeek-V3 Technical Report

    Association for Computational Linguistics. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv:2412.19437,

  8. [8]

    Alice Meshbane and John D Morris. Predictive discriminant analysis versus logistic regression in two-group classification problems.Annual Meeting of the American Educational Research Association (New York, NY, April 8-12, 1996).,

  9. [9]

    Adding error bars to evals: A statistical approach to language model evaluations

    Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations. arXiv:2411.00640,

  10. [10]

    Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond

    Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. Abstractive text summarization using sequence-to-sequence rnns and beyond.arXiv:1602.06023,

  11. [11]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv:2311.12022,

  12. [12]

    Musr: Testing the limits of chain-of-thought with multistep soft reasoning.ArXiv, abs/2310.16049,

    12 Published as a conference paper at ICLR 2026 Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv:2310.16049,

  13. [13]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big- bench tasks and whether chain-of-thought can solve them.arXiv:2210.09261,

  14. [14]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.arXiv:2406.01574,

  15. [15]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-a...

  16. [16]

    Qwen3 Technical Report

    Association for Computational Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388, 2025a. Jing Yang, Ruibo Wang, Yijun Song, and Jihong Li. Block-regularized 5×2 cross-validated mcne- mar’s test for comparing two classification al...

  17. [17]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv:2311.07911,

  18. [18]

    A Survey on Efficient Inference for Large Language Models

    Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models.arXiv:2404.14294,

  19. [19]

    13 Published as a conference paper at ICLR 2026 A NOTATION ANDALGORITHMS In Table 3 we summarize the notation that we use throughout the paper and Algorithms 1 to 4 describe our (aggregation) algorithms. Table 3: Notation Summary Symbol Description Formula M, ˜MBaseline and optimized models – NTotal sample size – TNumber of tasks – a,b,c,dContingency tabl...

  20. [20]

    Significance levelα

    (13) Denoting the population probabilities withP a,P b,P c,P d we have E[D(X)] =P b −P c,(14) E[D2(X)] =P b +P c.(15) 14 Published as a conference paper at ICLR 2026 Algorithm 3Fisher Aggregation Test Require:Listsb list,c list forTtasks. Significance levelα. 1:fori= 1toTdo 2:p i ←McNemar(b=b list[i], c=c list[i], α) 3:χ 2 stat ← −2PT i=1 ln(pi) 4:returnp...

  21. [21]

    We refer to the HuggingFace model cards for full information about their configurations

    First we observe 15 Published as a conference paper at ICLR 2026 Table 4: Full model checkpoint specifiers used in the experiments. We refer to the HuggingFace model cards for full information about their configurations. ID HuggingFace Model Repository Llama-3.1 8B meta-llama/Llama-3.1-8B-Instruct w4a16 RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 ...

  22. [22]

    Algorithm 5Permutation Pooled Test Require:Score listsL M = [ ˆLM(x1),

    standardizes per-task differences 7https://artificialanalysis.ai/methodology/intelligence-benchmarking# intelligence-index-evaluation-suite-summary 8https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals 21 Published as a conference paper at ICLR 2026 by their standard errors and uses the maximum standardized difference as the test statistic, analogous...

  23. [23]

    We therefore compare the 20B model against a rerun and against a version with FP8 KV cache

    Since the GPT-OSS models already come with their MoE modules in MX-FP4 precision by default, we could not find meaningful models that are further quan- tized. We therefore compare the 20B model against a rerun and against a version with FP8 KV cache. We also include a pruned variant which only has 7 experts per MoE layer gpt-oss-6.0b-specialized-all-prune...