Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

Jon-Paul Cacioli

arxiv: 2604.27405 · v1 · submitted 2026-04-30 · 💻 cs.CL · cs.AI

Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

Jon-Paul Cacioli This is my paper

Pith reviewed 2026-05-07 07:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM evaluationreliable change indexitem-level analysisbenchmark churnMMLU-Promodel version comparisonaggregate accuracysampling variability

0 comments

The pith

LLM accuracy gains on benchmarks are the net result of opposing improvements and deteriorations at the individual question level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper adapts the Reliable Change Index from clinical psychology to track whether LLMs actually change their answers to specific test items between versions. On 2,000 MMLU-Pro questions sampled ten times each, most items show no reliable change, yet the items that do change split almost evenly between gains and losses that largely cancel in the headline score. The method uncovers consistent patterns: easier items tend to improve while harder ones worsen, and different model families lose ground in different domains. Standard one-shot greedy decoding misses many of these real shifts. The core insight is that reported accuracy improvements are residuals after many item-level reversals have occurred.

Core claim

The aggregate accuracy gain between LLM versions is the net residual of opposing item-level movements. Using the Reliable Change Index on MMLU-Pro, 79% of items for Llama 3 to 3.1 and 72% for Qwen 2.5 to 3 showed no reliable change; among the rest, 34% improved and 28% deteriorated for Llama while 47% improved and 39% deteriorated for Qwen, with median absolute probability shifts of 0.50 and 0.90 respectively.

What carries the argument

The Reliable Change Index (RCI) applied to per-item response probabilities across K=10 samples at temperature 0.7, which classifies each question as reliably improved, reliably deteriorated, or unchanged while accounting for sampling variability.

If this is right

Churn rate must be reported next to aggregate accuracy when comparing LLM versions.
Greedy single-shot evaluation misses 42% of reliably changed items while incorrectly flagging 25% of stable items.
Item changes are asymmetric by difficulty, with low-accuracy questions improving and high-accuracy questions deteriorating.
Domain-level reversals are family-specific, such as Llama losing ground in physics while Qwen loses ground in law.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks that report only aggregate scores may be understating the amount of model behavior that is actually unstable across versions.
Model developers could use item-level churn maps to diagnose which capabilities are being strengthened or weakened in each update.
Repeating the analysis on other benchmarks with different difficulty distributions would test whether bidirectional churn is a general feature of current LLM scaling.

Load-bearing premise

The standard RCI thresholds and the chosen sampling regime of ten draws at temperature 0.7 correctly identify reliable item-level changes for these LLMs without needing benchmark-specific or domain-specific recalibration.

What would settle it

A follow-up study that uses a substantially larger number of samples per item or applies domain-specific variance estimates and still finds no net cancellation between item improvements and deteriorations would falsify the claim that aggregate gains are residuals of opposing movements.

Figures

Figures reproduced from arXiv: 2604.27405 by Jon-Paul Cacioli.

**Figure 1.** Figure 1: RCI value distribution for both model pairs view at source ↗

**Figure 3.** Figure 3: Domain × RCI category heatmap (postexclusion). Llama loses physics; Qwen loses law. The domains that deteriorate differ across families. 3.8 Cross-pair analyses H3 (cross-pair divergence) was supported: Qwen showed a higher reliable-change rate (85.9%) than Llama (62.1%), z = −10.4, p < .001. This is consistent with the generational versus minor update distinction. The cross-pair item-level RCI correlati… view at source ↗

**Figure 4.** Figure 4: Per-item accuracy in v1 versus v2, coloured by RCI classification. Left: Llama 3 → 3.1 (n = 952). Right: Qwen 2.5 → 3 (n = 652). Dashed line: no change. Dotted lines: RCI detection band. analysable subset where stochastic variation occurred, bidirectional churn with large effect sizes was the dominant pattern. The distinction matters. The churn finding is conditional on the analysable subset. It does not … view at source ↗

read the original abstract

We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). On the full benchmark, most items showed no reliable change (79% and 72%). However, over half the items were floor/ceiling. Among analysable items, change was bidirectional with large effect sizes: 34% improved and 28% deteriorated for Llama; 47% improved and 39% deteriorated for Qwen (median |delta p| = 0.50 and 0.90). Churn was asymmetric by difficulty: low-accuracy items improved, high-accuracy items deteriorated. Domain-level decomposition revealed family-specific reversals: Llama lost physics while Qwen lost law. Greedy single-shot evaluation missed 42% of reliably changed items and falsely flagged 25% of unchanged items. The aggregate accuracy gain is the net residual of opposing item-level movements. We recommend reporting churn rate alongside aggregate accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Aggregate LLM gains mask bidirectional item churn that an adapted RCI can detect, though the thresholds need checking for sampling noise.

read the letter

The key takeaway is that reported accuracy improvements when moving from one LLM version to the next are the net result of items that get better and items that get worse. The paper adapts the Reliable Change Index from Jacobson and Truax 1991 to flag reliable per-item performance shifts. They run this on 2000 MMLU-Pro items with K=10 samples at temperature 0.7 for two within-family upgrades. The results show most items have no reliable change, but among the analysable ones changes are bidirectional, with asymmetry by difficulty and some domain reversals. They also note that single greedy evaluations miss a large fraction of these shifts. This is new in the LLM evaluation space because most papers just report mean accuracy or a few other aggregates. Here they surface the hidden opposing movements and give specific rates like 34% improved versus 28% deteriorated for the Llama pair. Using an established method from another field is a reasonable starting point, and the public benchmark makes the claims checkable. The soft spot is the lack of validation for the RCI thresholds in this setting. Clinical RCI assumes continuous scores with known reliability, but here we have binomial proportions from small samples. The standard error around p=0.5 is roughly 0.16, which means the difference scores have low reliability and the usual cutoffs might misclassify sampling variation as real change. The paper does not appear to recalibrate or run robustness checks on this. This work is aimed at researchers who care about evaluation practices beyond headline numbers. A reader interested in benchmarking methodology would get value from the concrete patterns and the call to report churn. It deserves peer review so the method can be examined more closely, particularly the adaptation details.

Referee Report

3 major / 2 minor

Summary. The paper adapts the Reliable Change Index (RCI) from Jacobson and Truax (1991) to perform item-level analysis of performance changes between LLM versions on 2,000 MMLU-Pro items, using K=10 samples per item at temperature 0.7. It examines two within-family upgrades (Llama 3 to 3.1 yielding +1.6 aggregate points; Qwen 2.5 to 3 yielding +2.8 points) and reports that most items exhibit no reliable change (79% and 72%), with over half being floor/ceiling effects. Among analysable items, changes are bidirectional (34% improved/28% deteriorated for Llama; 47%/39% for Qwen, with median |delta p| of 0.50 and 0.90), asymmetric by difficulty, and domain-specific (e.g., Llama loses physics while Qwen loses law). The central claim is that aggregate accuracy gains are the net residual of opposing item-level movements. The paper also finds that greedy single-shot evaluation misses 42% of reliably changed items and falsely flags 25% of unchanged items, recommending that churn rates be reported alongside aggregate accuracy.

Significance. If the RCI adaptation is statistically appropriate for binomial LLM outputs, the work offers a valuable shift from mean-only evaluation toward granular detection of churn, stability, and domain reversals. The quantitative results on public data (bidirectional splits, 42% miss rate vs. greedy) provide a concrete demonstration that small aggregate gains can mask substantial item-level flux. This could influence evaluation protocols in the field by encouraging stability metrics. The approach is internally consistent with low circularity (external RCI reference plus new experiments) and merits follow-up, but its immediate significance is tempered by the unvalidated transfer of clinical RCI thresholds to small-K stochastic sampling.

major comments (3)

[Methods (RCI Adaptation)] Methods section on RCI adaptation: The paper applies the standard Jacobson-Truax RCI formula and clinical cutoffs directly to per-item proportions estimated from K=10 samples at T=0.7, without reported recalibration, binomial variance adjustment, or Monte Carlo validation of false-positive rates. With binomial SE ≈ 0.15–0.16 near p=0.5, the difference-score reliability is low; the same cutoff may therefore classify sampling noise as 'reliable change.' This is load-bearing for the central claim, as the reported 34%/28% and 47%/39% bidirectional splits, the 79%/72% no-change rates, and the 'net residual' interpretation all depend on correct partitioning of the 2,000 items.
[Results (Analysable Items)] Results (Analysable Items and Churn Analysis): The definition of 'analysable' items (excluding floor/ceiling) and the exact computation of the RCI denominator (S_diff) are not fully specified for discrete binomial data. Standard RCI assumes continuous scores and a reliability coefficient r; applying it without empirical r estimation or adjustment for small K risks misclassifying items, which directly affects the difficulty-asymmetric churn findings and the domain-level reversals (physics vs. law).
[Experiments (Greedy Comparison)] Experiments (Greedy Comparison): The claim that greedy evaluation 'missed 42% of reliably changed items and falsely flagged 25% of unchanged items' treats the RCI classifications as ground truth. Without an independent validation (e.g., larger K, human ratings, or noise-injection simulations), this comparison inherits the same untested threshold assumption and cannot yet be used to critique standard practice.

minor comments (2)

[Abstract] Abstract: The median |delta p| values (0.50 and 0.90) are reported without specifying which model pair each corresponds to; add this clarification and ensure consistency with the main-text tables.
[Methods] Notation: The paper uses 'delta p' for item-level accuracy change; define this explicitly early in the Methods and confirm it is the difference in sample proportions (not logit or other transform).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our adaptation of the Reliable Change Index to LLM evaluation. We address each major comment below and have revised the manuscript to improve clarity, add empirical validation, and appropriately qualify our claims.

read point-by-point responses

Referee: [Methods (RCI Adaptation)] The paper applies the standard Jacobson-Truax RCI formula and clinical cutoffs directly to per-item proportions estimated from K=10 samples at T=0.7, without reported recalibration, binomial variance adjustment, or Monte Carlo validation of false-positive rates. With binomial SE ≈ 0.15–0.16 near p=0.5, the difference-score reliability is low; the same cutoff may therefore classify sampling noise as 'reliable change.'

Authors: We appreciate the referee's concern about transferring clinical RCI thresholds to stochastic binomial outputs. The standard RCI formula remains applicable to estimated proportions, with SE derived from the binomial variance sqrt(p(1-p)/K). To directly address potential over-classification of noise, the revised manuscript will include a Monte Carlo simulation: for each item we draw repeated null pairs of K=10 samples from the observed p under no true change, apply the RCI threshold, and report the resulting false-positive rate. This provides an empirical check on the 1.96 cutoff for our specific sampling regime and supports the reported bidirectional splits. revision: yes
Referee: [Results (Analysable Items)] The definition of 'analysable' items (excluding floor/ceiling) and the exact computation of the RCI denominator (S_diff) are not fully specified for discrete binomial data. Standard RCI assumes continuous scores and a reliability coefficient r; applying it without empirical r estimation or adjustment for small K risks misclassifying items.

Authors: We agree the manuscript should be more explicit. In the revision we will define analysable items as those with estimated accuracy strictly between 0 and 1 in both model versions (excluding pure floor/ceiling cases where change is undefined). S_diff is computed as sqrt(2 * SE^2 * (1-r)), where SE is the average binomial standard error across the two models and r is the observed correlation between the two independent K=10 sample vectors per item (or a conservative default of 0.5 when correlation is low). We will also report sensitivity analyses across r in [0.3, 0.8] to demonstrate robustness of the difficulty-asymmetric and domain-specific findings. revision: yes
Referee: [Experiments (Greedy Comparison)] The claim that greedy evaluation 'missed 42% of reliably changed items and falsely flagged 25% of unchanged items' treats the RCI classifications as ground truth. Without an independent validation (e.g., larger K, human ratings, or noise-injection simulations), this comparison inherits the same untested threshold assumption and cannot yet be used to critique standard practice.

Authors: We acknowledge that the greedy comparison is relative to the RCI reference and does not constitute absolute ground truth. The intent is to quantify how single greedy draws diverge from multi-sample change detection. In the revision we will reframe the claim as a relative discrepancy under the RCI model and add a controlled noise-injection simulation: we inject binomial noise into stable items and measure how often greedy evaluation produces spurious changes. We do not possess larger-K or human-rated data in the current study, so we will not claim external validation beyond this simulation. revision: partial

Circularity Check

0 steps flagged

No circularity; external RCI adapted to new LLM experiments

full rationale

The paper adapts the Reliable Change Index directly from the independent 1991 Jacobson & Truax reference and applies it to fresh sampling (K=10 at T=0.7) on 2000 MMLU-Pro items for two model pairs. The central claim that aggregate accuracy gains are the net residual of opposing item-level movements follows from counting items flagged as improved or deteriorated by the standard RCI formula; this count is not presupposed in the method definition, nor is any parameter fitted to the target aggregate result. No self-citations appear in the load-bearing steps, no ansatz is smuggled, and no uniqueness theorem is invoked. The derivation chain is therefore self-contained against the external benchmark and the cited clinical formula.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The paper introduces no new free parameters beyond standard choices for sampling and relies on an external statistical method and public benchmark data.

free parameters (2)

Sampling temperature = 0.7
Chosen value for generating multiple responses to estimate per-item variability
Number of samples per item = 10
Selected to compute accuracy proportions and change statistics

axioms (2)

domain assumption The Reliable Change Index from clinical psychology can be adapted to measure changes in LLM response accuracy on individual benchmark items
The paper directly applies the 1991 RCI formula to LLM outputs
domain assumption Floor and ceiling items (over half the dataset) can be excluded without affecting the validity of conclusions about analysable items
Mentioned in abstract as over half were floor/ceiling

pith-pipeline@v0.9.0 · 5523 in / 1687 out tokens · 63937 ms · 2026-05-07T07:58:01.642162+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

InProceedings of the International Con- ference on Machine Learning

tinyBenchmarks: Evaluating LLMs with fewer examples. InProceedings of the International Con- ference on Machine Learning. Scott Sievert, Leo Gao, Bonnie Dorr, and Eduard Hovy

work page
[2]

InAdvances in Neural Information Processing Systems

Measuring what matters: Construct validity in large language model benchmarks. InAdvances in Neural Information Processing Systems. Son Truong, Benjamin Domingue, and Sanmi Koyejo

work page
[3]

Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025

Fantastic bugs and how to squash them. In Advances in Neural Information Processing Systems. Joshua Vendrow, Edward Gu, Isabel Papadimitriou, Daniel Hsu, and Tatsunori Hashimoto. 2025. Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weimin...

work page arXiv 2025
[4]

From static benchmarks to adaptive test- ing: Psychometrics in AI evaluation.arXiv preprint arXiv:2306.10512. 7

work page arXiv

[1] [1]

InProceedings of the International Con- ference on Machine Learning

tinyBenchmarks: Evaluating LLMs with fewer examples. InProceedings of the International Con- ference on Machine Learning. Scott Sievert, Leo Gao, Bonnie Dorr, and Eduard Hovy

work page

[2] [2]

InAdvances in Neural Information Processing Systems

Measuring what matters: Construct validity in large language model benchmarks. InAdvances in Neural Information Processing Systems. Son Truong, Benjamin Domingue, and Sanmi Koyejo

work page

[3] [3]

Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025

Fantastic bugs and how to squash them. In Advances in Neural Information Processing Systems. Joshua Vendrow, Edward Gu, Isabel Papadimitriou, Daniel Hsu, and Tatsunori Hashimoto. 2025. Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weimin...

work page arXiv 2025

[4] [4]

From static benchmarks to adaptive test- ing: Psychometrics in AI evaluation.arXiv preprint arXiv:2306.10512. 7

work page arXiv