Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation
Pith reviewed 2026-05-07 07:58 UTC · model grok-4.3
The pith
LLM accuracy gains on benchmarks are the net result of opposing improvements and deteriorations at the individual question level.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The aggregate accuracy gain between LLM versions is the net residual of opposing item-level movements. Using the Reliable Change Index on MMLU-Pro, 79% of items for Llama 3 to 3.1 and 72% for Qwen 2.5 to 3 showed no reliable change; among the rest, 34% improved and 28% deteriorated for Llama while 47% improved and 39% deteriorated for Qwen, with median absolute probability shifts of 0.50 and 0.90 respectively.
What carries the argument
The Reliable Change Index (RCI) applied to per-item response probabilities across K=10 samples at temperature 0.7, which classifies each question as reliably improved, reliably deteriorated, or unchanged while accounting for sampling variability.
If this is right
- Churn rate must be reported next to aggregate accuracy when comparing LLM versions.
- Greedy single-shot evaluation misses 42% of reliably changed items while incorrectly flagging 25% of stable items.
- Item changes are asymmetric by difficulty, with low-accuracy questions improving and high-accuracy questions deteriorating.
- Domain-level reversals are family-specific, such as Llama losing ground in physics while Qwen loses ground in law.
Where Pith is reading between the lines
- Benchmarks that report only aggregate scores may be understating the amount of model behavior that is actually unstable across versions.
- Model developers could use item-level churn maps to diagnose which capabilities are being strengthened or weakened in each update.
- Repeating the analysis on other benchmarks with different difficulty distributions would test whether bidirectional churn is a general feature of current LLM scaling.
Load-bearing premise
The standard RCI thresholds and the chosen sampling regime of ten draws at temperature 0.7 correctly identify reliable item-level changes for these LLMs without needing benchmark-specific or domain-specific recalibration.
What would settle it
A follow-up study that uses a substantially larger number of samples per item or applies domain-specific variance estimates and still finds no net cancellation between item improvements and deteriorations would falsify the claim that aggregate gains are residuals of opposing movements.
Figures
read the original abstract
We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). On the full benchmark, most items showed no reliable change (79% and 72%). However, over half the items were floor/ceiling. Among analysable items, change was bidirectional with large effect sizes: 34% improved and 28% deteriorated for Llama; 47% improved and 39% deteriorated for Qwen (median |delta p| = 0.50 and 0.90). Churn was asymmetric by difficulty: low-accuracy items improved, high-accuracy items deteriorated. Domain-level decomposition revealed family-specific reversals: Llama lost physics while Qwen lost law. Greedy single-shot evaluation missed 42% of reliably changed items and falsely flagged 25% of unchanged items. The aggregate accuracy gain is the net residual of opposing item-level movements. We recommend reporting churn rate alongside aggregate accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper adapts the Reliable Change Index (RCI) from Jacobson and Truax (1991) to perform item-level analysis of performance changes between LLM versions on 2,000 MMLU-Pro items, using K=10 samples per item at temperature 0.7. It examines two within-family upgrades (Llama 3 to 3.1 yielding +1.6 aggregate points; Qwen 2.5 to 3 yielding +2.8 points) and reports that most items exhibit no reliable change (79% and 72%), with over half being floor/ceiling effects. Among analysable items, changes are bidirectional (34% improved/28% deteriorated for Llama; 47%/39% for Qwen, with median |delta p| of 0.50 and 0.90), asymmetric by difficulty, and domain-specific (e.g., Llama loses physics while Qwen loses law). The central claim is that aggregate accuracy gains are the net residual of opposing item-level movements. The paper also finds that greedy single-shot evaluation misses 42% of reliably changed items and falsely flags 25% of unchanged items, recommending that churn rates be reported alongside aggregate accuracy.
Significance. If the RCI adaptation is statistically appropriate for binomial LLM outputs, the work offers a valuable shift from mean-only evaluation toward granular detection of churn, stability, and domain reversals. The quantitative results on public data (bidirectional splits, 42% miss rate vs. greedy) provide a concrete demonstration that small aggregate gains can mask substantial item-level flux. This could influence evaluation protocols in the field by encouraging stability metrics. The approach is internally consistent with low circularity (external RCI reference plus new experiments) and merits follow-up, but its immediate significance is tempered by the unvalidated transfer of clinical RCI thresholds to small-K stochastic sampling.
major comments (3)
- [Methods (RCI Adaptation)] Methods section on RCI adaptation: The paper applies the standard Jacobson-Truax RCI formula and clinical cutoffs directly to per-item proportions estimated from K=10 samples at T=0.7, without reported recalibration, binomial variance adjustment, or Monte Carlo validation of false-positive rates. With binomial SE ≈ 0.15–0.16 near p=0.5, the difference-score reliability is low; the same cutoff may therefore classify sampling noise as 'reliable change.' This is load-bearing for the central claim, as the reported 34%/28% and 47%/39% bidirectional splits, the 79%/72% no-change rates, and the 'net residual' interpretation all depend on correct partitioning of the 2,000 items.
- [Results (Analysable Items)] Results (Analysable Items and Churn Analysis): The definition of 'analysable' items (excluding floor/ceiling) and the exact computation of the RCI denominator (S_diff) are not fully specified for discrete binomial data. Standard RCI assumes continuous scores and a reliability coefficient r; applying it without empirical r estimation or adjustment for small K risks misclassifying items, which directly affects the difficulty-asymmetric churn findings and the domain-level reversals (physics vs. law).
- [Experiments (Greedy Comparison)] Experiments (Greedy Comparison): The claim that greedy evaluation 'missed 42% of reliably changed items and falsely flagged 25% of unchanged items' treats the RCI classifications as ground truth. Without an independent validation (e.g., larger K, human ratings, or noise-injection simulations), this comparison inherits the same untested threshold assumption and cannot yet be used to critique standard practice.
minor comments (2)
- [Abstract] Abstract: The median |delta p| values (0.50 and 0.90) are reported without specifying which model pair each corresponds to; add this clarification and ensure consistency with the main-text tables.
- [Methods] Notation: The paper uses 'delta p' for item-level accuracy change; define this explicitly early in the Methods and confirm it is the difference in sample proportions (not logit or other transform).
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback on our adaptation of the Reliable Change Index to LLM evaluation. We address each major comment below and have revised the manuscript to improve clarity, add empirical validation, and appropriately qualify our claims.
read point-by-point responses
-
Referee: [Methods (RCI Adaptation)] The paper applies the standard Jacobson-Truax RCI formula and clinical cutoffs directly to per-item proportions estimated from K=10 samples at T=0.7, without reported recalibration, binomial variance adjustment, or Monte Carlo validation of false-positive rates. With binomial SE ≈ 0.15–0.16 near p=0.5, the difference-score reliability is low; the same cutoff may therefore classify sampling noise as 'reliable change.'
Authors: We appreciate the referee's concern about transferring clinical RCI thresholds to stochastic binomial outputs. The standard RCI formula remains applicable to estimated proportions, with SE derived from the binomial variance sqrt(p(1-p)/K). To directly address potential over-classification of noise, the revised manuscript will include a Monte Carlo simulation: for each item we draw repeated null pairs of K=10 samples from the observed p under no true change, apply the RCI threshold, and report the resulting false-positive rate. This provides an empirical check on the 1.96 cutoff for our specific sampling regime and supports the reported bidirectional splits. revision: yes
-
Referee: [Results (Analysable Items)] The definition of 'analysable' items (excluding floor/ceiling) and the exact computation of the RCI denominator (S_diff) are not fully specified for discrete binomial data. Standard RCI assumes continuous scores and a reliability coefficient r; applying it without empirical r estimation or adjustment for small K risks misclassifying items.
Authors: We agree the manuscript should be more explicit. In the revision we will define analysable items as those with estimated accuracy strictly between 0 and 1 in both model versions (excluding pure floor/ceiling cases where change is undefined). S_diff is computed as sqrt(2 * SE^2 * (1-r)), where SE is the average binomial standard error across the two models and r is the observed correlation between the two independent K=10 sample vectors per item (or a conservative default of 0.5 when correlation is low). We will also report sensitivity analyses across r in [0.3, 0.8] to demonstrate robustness of the difficulty-asymmetric and domain-specific findings. revision: yes
-
Referee: [Experiments (Greedy Comparison)] The claim that greedy evaluation 'missed 42% of reliably changed items and falsely flagged 25% of unchanged items' treats the RCI classifications as ground truth. Without an independent validation (e.g., larger K, human ratings, or noise-injection simulations), this comparison inherits the same untested threshold assumption and cannot yet be used to critique standard practice.
Authors: We acknowledge that the greedy comparison is relative to the RCI reference and does not constitute absolute ground truth. The intent is to quantify how single greedy draws diverge from multi-sample change detection. In the revision we will reframe the claim as a relative discrepancy under the RCI model and add a controlled noise-injection simulation: we inject binomial noise into stable items and measure how often greedy evaluation produces spurious changes. We do not possess larger-K or human-rated data in the current study, so we will not claim external validation beyond this simulation. revision: partial
Circularity Check
No circularity; external RCI adapted to new LLM experiments
full rationale
The paper adapts the Reliable Change Index directly from the independent 1991 Jacobson & Truax reference and applies it to fresh sampling (K=10 at T=0.7) on 2000 MMLU-Pro items for two model pairs. The central claim that aggregate accuracy gains are the net residual of opposing item-level movements follows from counting items flagged as improved or deteriorated by the standard RCI formula; this count is not presupposed in the method definition, nor is any parameter fitted to the target aggregate result. No self-citations appear in the load-bearing steps, no ansatz is smuggled, and no uniqueness theorem is invoked. The derivation chain is therefore self-contained against the external benchmark and the cited clinical formula.
Axiom & Free-Parameter Ledger
free parameters (2)
- Sampling temperature =
0.7
- Number of samples per item =
10
axioms (2)
- domain assumption The Reliable Change Index from clinical psychology can be adapted to measure changes in LLM response accuracy on individual benchmark items
- domain assumption Floor and ceiling items (over half the dataset) can be excluded without affecting the validity of conclusions about analysable items
Reference graph
Works this paper leans on
-
[1]
InProceedings of the International Con- ference on Machine Learning
tinyBenchmarks: Evaluating LLMs with fewer examples. InProceedings of the International Con- ference on Machine Learning. Scott Sievert, Leo Gao, Bonnie Dorr, and Eduard Hovy
-
[2]
InAdvances in Neural Information Processing Systems
Measuring what matters: Construct validity in large language model benchmarks. InAdvances in Neural Information Processing Systems. Son Truong, Benjamin Domingue, and Sanmi Koyejo
-
[3]
Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025
Fantastic bugs and how to squash them. In Advances in Neural Information Processing Systems. Joshua Vendrow, Edward Gu, Isabel Papadimitriou, Daniel Hsu, and Tatsunori Hashimoto. 2025. Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weimin...
- [4]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.