arxiv: 2605.11205 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

Jung Min Kang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords benchmark evaluationitem response theorydata sparsitydifficulty heterogeneityranking correlationsimulation experimentsAI safety evaluationmodel comparison

0 comments

The pith

Simple averaging of benchmark scores produces unreliable model rankings when data is sparse and items vary in difficulty, while Item Response Theory models recover accurate rankings across domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that simple averaging, the default way to rank systems on benchmarks, breaks down when not every model is evaluated on every item and when items differ substantially in how hard they are. Simulations generate responses from known true abilities and difficulties in four domains including NLP, clinical trials, autonomous driving, and cybersecurity. Under 67 percent coverage with high difficulty spread, averaging's rank correlation with ground truth falls to 0.809 while a standard two-parameter logistic Item Response Theory model stays above 0.996. A grid of sparsity and difficulty-gap conditions shows ranking error grows with their interaction. This matters for any field that publishes leaderboards from incomplete test matrices, especially safety-critical physical AI evaluations where missing data and extreme difficulty differences are common.

Core claim

Through controlled simulation experiments across four domains, simple averaging produces Spearman rank correlation between average-based rankings and ground-truth rankings that degrades from 1.000 at 100 percent coverage to 0.809 at 67 percent coverage with high difficulty heterogeneity. A standard two-parameter logistic Item Response Theory model maintains correlation of at least 0.996 across all tested conditions. A 150-condition sweep over sparsity levels from 0 to 0.70 and difficulty gaps from 0.5 to 5.0 confirms that ranking error forms a failure surface with a strong positive interaction between sparsity and difficulty gap.

What carries the argument

The two-parameter logistic Item Response Theory model that jointly estimates latent model abilities and item difficulties from observed binary responses, allowing it to impute missing entries and produce stable rankings even when the evaluation matrix is incomplete.

If this is right

Benchmark rankings based on simple averages become progressively less trustworthy as the fraction of missing evaluations grows and as item difficulties spread out.
Item Response Theory estimation can be substituted for averaging to maintain high rank accuracy even with substantial missing data.
Physical AI and other safety evaluations that routinely have incomplete matrices and large difficulty gaps require adjustment beyond raw averages to avoid distorted conclusions about system performance.
The size of ranking distortion scales with the combined effect of sparsity level and difficulty gap rather than either factor alone.
Benchmark reporting should include diagnostics for coverage and difficulty variation when averages are presented as the primary ranking method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Leaderboards that rely on averaging may currently list models in the wrong order whenever coverage is incomplete and harder items are unevenly distributed.
Switching to IRT-based scoring on existing public benchmarks could reorder published results without collecting new data.
The same sparsity-by-difficulty interaction is likely to appear in human testing, medical diagnostics, or educational assessments whenever not every participant sees every item.
Collecting independent difficulty ratings for benchmark items would let practitioners predict in advance how much averaging error to expect for a given dataset.

Load-bearing premise

The ground-truth abilities and item difficulties used to generate the simulated responses, together with the pattern of introduced missing data, match the structure found in actual benchmark evaluations.

What would settle it

A large real benchmark dataset that includes independent measures of true model ability or difficulty for each item; comparing how closely averaging versus IRT rankings match those independent measures under the observed sparsity and difficulty spread would confirm or refute the claimed collapse.

Figures

Figures reproduced from arXiv: 2605.11205 by Jung Min Kang.

read the original abstract

Benchmark evaluation across AI and safety-critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co-occur: (1) the evaluation matrix is sparse and (2) items vary substantially in difficulty. Through controlled simulation experiments across four domains -- NLP (GLUE), clinical drug trials, autonomous vehicle safety, and cybersecurity -- we show that Spearman rank correlation $\rho$ between simple-average rankings and ground-truth rankings degrades from $\rho = 1.000$ at 100% coverage to $\rho = 0.809$ at 67% coverage with high difficulty heterogeneity (mean over 20 seeds). A standard two-parameter logistic (2PL) Item Response Theory (IRT) model maintains $\rho \geq 0.996$ across all conditions. A 150-condition grid sweep over sparsity $S \in [0, 0.70]$ and difficulty gap $D \in [0.5, 5.0]$ confirms that ranking error forms a failure surface with a strong $S \times D$ interaction ($\gamma_3 = +0.20$, $t = 13.05$), while IRT maintains $\rho \geq 0.993$ throughout. We discuss implications for Physical AI benchmarking, where evaluation matrices are often incomplete and difficulty gaps are extreme.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Simulations show averaging rankings degrade under sparsity plus difficulty gaps while IRT stays stable, but the data is generated from the same 2PL model so the win is built in.

read the letter

The core result is that simple averaging of scores produces Spearman correlations with ground truth that fall to 0.81 when coverage drops to 67% and difficulty heterogeneity is high, while a 2PL IRT model stays above 0.996. The 150-condition grid confirms a clear S times D interaction term with t=13.05. That quantitative mapping of the failure surface is the main new piece; prior work on IRT in AI eval has not laid out the interaction this explicitly across multiple simulated domains.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that simple averaging of benchmark scores produces misleading model rankings when evaluation matrices are sparse and items differ substantially in difficulty. Through simulations across four domains (NLP/GLUE, clinical trials, autonomous vehicles, cybersecurity), it reports that Spearman ρ for averaging drops from 1.000 at full coverage to 0.809 at 67% coverage with high difficulty heterogeneity, while a 2PL IRT model maintains ρ ≥ 0.996. A 150-condition grid over sparsity S and difficulty gap D reveals a strong S × D interaction (γ₃ = +0.20, t = 13.05) for averaging error, with IRT remaining stable at ρ ≥ 0.993; implications are drawn for Physical AI benchmarking.

Significance. If the core finding generalizes, the work would usefully caution against default averaging in incomplete, heterogeneous evaluation settings and motivate wider adoption of IRT-style latent-trait models. The controlled multi-domain simulations, 20-seed replication, and explicit reporting of the interaction t-statistic constitute clear strengths that allow precise quantification of the failure surface.

major comments (3)

[§3] §3 (Simulation Design): Responses are generated from the identical two-parameter logistic model that is later fitted for recovery. Consequently the reported ρ ≥ 0.996 for IRT is guaranteed by construction whenever the generative assumptions hold, while averaging has no corresponding mechanism; this matched pair does not test whether IRT would retain its advantage under realistic misspecification (non-logistic response functions, guessing, or domain-specific noise).
[§5] §5 (Discussion and Implications): The extrapolation to real benchmarks (GLUE, clinical trials, etc.) rests on the untested premise that the simulated sparsity patterns (67 % coverage) and difficulty gaps (D up to 5.0) reproduce the missing-data mechanisms and heterogeneity actually observed in those domains. No empirical validation of the simulated matrices against real evaluation data is provided, so the practical claim that IRT “recovers ground truth across domains” remains conditional on the weakest assumption identified in the review.
[Table 2] Table 2 / Grid-sweep results: The failure-surface regression is reported only for averaging; the corresponding coefficients and t-statistics for the IRT model are omitted. Without these, it is impossible to quantify how much of the S × D interaction is eliminated by IRT versus merely attenuated.

minor comments (2)

[Abstract] The abstract states “mean over 20 seeds” for the ρ = 0.809 figure but does not report standard deviation or confidence intervals; adding these would strengthen the quantitative claim.
[Figures] Figure captions for the failure-surface plots should explicitly state the number of Monte-Carlo replications and whether shading represents standard error or inter-quartile range.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and precise comments, which identify key limitations in our simulation design, reporting, and scope of claims. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§3] §3 (Simulation Design): Responses are generated from the identical two-parameter logistic model that is later fitted for recovery. Consequently the reported ρ ≥ 0.996 for IRT is guaranteed by construction whenever the generative assumptions hold, while averaging has no corresponding mechanism; this matched pair does not test whether IRT would retain its advantage under realistic misspecification (non-logistic response functions, guessing, or domain-specific noise).

Authors: We acknowledge that matching the generative and recovery models ensures high IRT recovery by construction when assumptions hold. Our intent was to isolate the failure of averaging under sparsity and difficulty heterogeneity when a latent-trait structure is present, rather than to test IRT robustness. We agree this leaves open the question of performance under misspecification. In the revised manuscript we will add a new subsection to §3 containing simulations with alternative generative processes (3PL with guessing parameter and a non-logistic linear response function) and report the resulting ρ values for both methods. revision: yes
Referee: [§5] §5 (Discussion and Implications): The extrapolation to real benchmarks (GLUE, clinical trials, etc.) rests on the untested premise that the simulated sparsity patterns (67 % coverage) and difficulty gaps (D up to 5.0) reproduce the missing-data mechanisms and heterogeneity actually observed in those domains. No empirical validation of the simulated matrices against real evaluation data is provided, so the practical claim that IRT “recovers ground truth across domains” remains conditional on the weakest assumption identified in the review.

Authors: The referee is correct that we provide no direct empirical matching of the simulated matrices to real item-level data. The chosen sparsity and difficulty ranges were motivated by published characteristics of the target domains, but we did not validate the exact missingness mechanisms or difficulty distributions against raw benchmark data. We will revise §5 to state explicitly that the results are conditional on these simulation assumptions and to recommend empirical validation with real response matrices as future work. Performing such validation now would require access to granular per-item scores that are not publicly available for all four domains. revision: partial
Referee: [Table 2] Table 2 / Grid-sweep results: The failure-surface regression is reported only for averaging; the corresponding coefficients and t-statistics for the IRT model are omitted. Without these, it is impossible to quantify how much of the S × D interaction is eliminated by IRT versus merely attenuated.

Authors: We agree that the regression coefficients for the IRT model must be reported to allow direct comparison. In the revised manuscript we will expand Table 2 (or add a companion table) with the full failure-surface regression for IRT, which yields γ₃ ≈ 0.02 (t < 2), confirming that the interaction is effectively eliminated rather than merely attenuated. revision: yes

standing simulated objections not resolved

Empirical validation of the simulated sparsity patterns (67 % coverage) and difficulty gaps against actual item-level response matrices from GLUE, clinical trials, autonomous-vehicle, and cybersecurity benchmarks.

Circularity Check

0 steps flagged

No significant circularity: simulations are self-contained with explicit known ground truth

full rationale

The paper generates simulated responses from explicitly constructed ground-truth abilities and item difficulties under the 2PL model, then applies both averaging and 2PL IRT to recover rankings from the resulting (sparse) matrix. This is the standard controlled-experiment design for comparing estimators against known truth rather than a reduction by construction; the reported ρ ≥ 0.996 for IRT simply confirms recovery under model match, while the degradation for averaging and the S × D interaction (γ3 = +0.20) constitute an independent empirical contrast within the 150-condition grid. No load-bearing self-citations, ansatz smuggling, uniqueness theorems, or renaming of known results occur. The derivation remains self-contained against the simulation benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach relies on standard assumptions from Item Response Theory and a simulation framework; no new free parameters beyond model fitting or invented entities are introduced.

free parameters (2)

item difficulty parameters
Estimated within the 2PL IRT model from the response data
discrimination parameters
Estimated within the 2PL IRT model

axioms (2)

domain assumption The probability of a correct response follows the two-parameter logistic function of latent ability and item parameters
Core assumption of the 2PL IRT model used
domain assumption Simulated responses are generated from known true abilities and difficulties to establish ground truth
Basis for comparing estimated rankings to true rankings

pith-pipeline@v0.9.0 · 5554 in / 1529 out tokens · 192316 ms · 2026-05-13T03:10:35.054272+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 3 internal anchors

[1]

Baker, F. B. and Kim, S.-H. Item Analysis in Testing and Item Response Theory for Scoring, Scale Construction, and Diagnostics. Marcel Dekker, 2nd edition, 2004

work page 2004
[2]

Autonomous Vehicle Disengagement Reports, 2023

California Department of Motor Vehicles. Autonomous Vehicle Disengagement Reports, 2023. https://www.dmv.ca.gov/portal/vehicle-industry-services/autonomous-vehicles/disengagement-reports/

work page 2023
[3]

Uebayashi, S. et al. M3IRT: Evaluating cross-modal reasoning ability and problem characteristics with multimodal item response theory. arXiv preprint arXiv:2603.02663, 2026

work page arXiv 2026
[4]

De Ayala, R. J. The Theory and Practice of Item Response Theory. Guilford Press, 2009

work page 2009
[5]

BERT: Pre-training of deep bidirectional transformers for language understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171--4186, 2019

work page 2019
[6]

Embretson, S. E. and Reise, S. P. Item Response Theory for Psychologists. Lawrence Erlbaum Associates, 2000

work page 2000
[7]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Fei, H., Wang, Z., and others. LIBERO-Plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Item Response Theory (IRT) Correlation Study

Federal Motor Carrier Safety Administration. Item Response Theory (IRT) Correlation Study. Technical report, U.S. Department of Transportation, 2021

work page 2021
[9]

DeBERTa: Decoding-enhanced BERT with disentangled attention

He, P., Liu, X., Gao, J., and Chen, W. DeBERTa: Decoding-enhanced BERT with disentangled attention. In Proceedings of ICLR, 2021

work page 2021
[10]

P., Wu, H., and Yu, H

Lalor, J. P., Wu, H., and Yu, H. Building an evaluation scale using item response theory. In Proceedings of EMNLP, pages 648--657, 2016

work page 2016
[11]

Lalor, J. P. and Rodriguez, P. py-irt: A scalable item response theory library for Python. INFORMS Journal on Computing, 34(5):2530--2537, 2022

work page 2022
[12]

P., Rodriguez, P., Sedoc, J., and Hern\' a ndez-Orallo, J

Lalor, J. P., Rodriguez, P., Sedoc, J., and Hern\' a ndez-Orallo, J. Item response theory for natural language processing. Tutorial at EACL, 2024

work page 2024
[13]

Liu, Y. et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[14]

Liu, B. et al. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. In Advances in Neural Information Processing Systems, 36:44776--44791, 2023

work page 2023
[15]

Lord, F. M. and Novick, M. R. Statistical Theories of Mental Test Scores. Addison-Wesley, 1968

work page 1968
[16]

P., White, I

Morris, T. P., White, I. R., and Crowther, M. J. Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11):2074--2102, 2019

work page 2074
[17]

Polo, F. M. et al. tinyBenchmarks: Evaluating LLMs with fewer examples. In Proceedings of ICML, PMLR 235, 2024

work page 2024
[18]

Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(140):1--67, 2020

work page 2020
[19]

Rodriguez, P. et al. Evaluation examples are not equally informative: How should that change NLP leaderboards? In Proceedings of ACL-IJCNLP, pages 4486--4503, 2021

work page 2021
[20]

Rodriguez, P. et al. IRT Leaderboard. https://github.com/facebookresearch/irt-leaderboard, 2021

work page 2021
[21]

Rubin, D. B. Inference and missing data. Biometrika, 63(3):581--592, 1976

work page 1976
[22]

Savage, S. L. The Flaw of Averages. John Wiley & Sons, 2009

work page 2009
[23]

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

Luo, Z., Wu, L., Frisch, A., and He, D. Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks. arXiv preprint arXiv:2509.24186, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Truong, S. et al. Reliable and efficient amortized model-based evaluation. arXiv preprint arXiv:2503.13335, 2025

work page arXiv 2025
[25]

The flaw of averages: Quantifying uniformity of performance on benchmarks

Uzuno g lu, A., Li, T., and Khashabi, D. The flaw of averages: Quantifying uniformity of performance on benchmarks. arXiv preprint arXiv:2509.25671, 2025

work page arXiv 2025
[26]

Wang, A. et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of ICLR Workshop, 2018

work page 2018
[27]

Zhou, H. et al. Lost in benchmarks? Rethinking large language model benchmarking with item response theory. In Proceedings of AAAI (Oral), 2026

work page 2026
[28]

Zhou, X. et al. LIBERO-PRO: Towards robust and fair evaluation of vision-language-action models beyond memorization. arXiv preprint arXiv:2510.03827, 2025

work page arXiv 2025
[29]

Efficient benchmarking of AI agents

Ndzomga, F. Efficient benchmarking of AI agents. arXiv preprint arXiv:2603.23749, 2026

work page arXiv 2026