A McNemar-based statistical test detects real degradations in optimized LLMs with controlled false positives, even for accuracy changes as small as 0.3%.
Algorithm 5Permutation Pooled Test Require:Score listsL M = [ ˆLM(x1)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
stat.ML 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
When LLMs get significantly worse: A statistical approach to detect model degradations
A McNemar-based statistical test detects real degradations in optimized LLMs with controlled false positives, even for accuracy changes as small as 0.3%.