Auditing LLM Benchmarks with Item Response Theory
Pith reviewed 2026-06-29 07:30 UTC · model grok-4.3
The pith
Item Response Theory applied to 114 models identifies mislabeled items in LLM benchmarks at 95% precision among the top 200 examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By fitting an IRT model to binary responses from 114 models, the authors obtain per-item difficulty and discrimination parameters that surface likely mislabels at 95% precision in the top 200 examples across seven benchmarks, outperforming a supervised classifier. The errors arise from mechanical labeling heuristics, upstream annotation mistakes inherited from source datasets, and fundamentally ambiguous items. The fitted model also shows reward models specialize in stylistic preference rather than factual knowledge, and identifies one frontier reward model that agrees with the detected mislabels at 78% accuracy versus 38% for its peers.
What carries the argument
The unidimensional Item Response Theory model that estimates difficulty and discrimination parameters for each benchmark item from the pattern of correct and incorrect responses across many models.
If this is right
- Benchmark errors commonly originate from simple labeling heuristics or mistakes copied from source datasets.
- Reward models capture stylistic preferences far more than factual knowledge.
- One frontier reward model aligns with detected mislabels at 78% accuracy, consistent with contamination or benchmark-specific over-optimization.
- The IRT approach provides a label-free way to audit benchmarks that outperforms training a supervised classifier on the same responses.
Where Pith is reading between the lines
- Periodic re-application of this auditing step could keep benchmark quality from degrading as new models are released.
- The observed specialization of reward models suggests current preference data may not drive gains in factual reasoning.
- The same response-pattern analysis could be tested on open-ended generation benchmarks to check for similar label problems.
Load-bearing premise
The IRT assumptions of a single latent ability dimension and locally independent responses hold well enough for LLM answer patterns that the resulting parameters reliably point to label errors rather than model idiosyncrasies.
What would settle it
Independent expert review of the top 200 items flagged by the IRT indicator showing precision below 80% would show the method does not reliably surface true mislabels.
Figures
read the original abstract
LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We introduce an Item Response Theory-based indicator that surfaces likely mislabels at 95% precision in the top 200 examples across seven preference and multiple-choice benchmarks using responses from 114 models, outperforming a supervised classifier. We trace these errors to mechanical labeling heuristics, upstream annotation mistakes inherited unchanged from source datasets, and fundamentally ambiguous items without a defensible single label. The same model fit reveals that reward models specialize in stylistic preference rather than factual knowledge, and identifies one frontier reward model that agrees with detected mislabels at 78% accuracy versus 38% for its peers, consistent with benchmark contamination or benchmark-specific over-optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an Item Response Theory (IRT)-based indicator to detect likely mislabeled items in LLM benchmarks. Fitting IRT models to responses from 114 models across seven preference and multiple-choice benchmarks, it claims 95% precision on the top 200 flagged examples, outperforming a supervised classifier. It traces errors to labeling heuristics, inherited annotation mistakes, and ambiguous items, while also showing that reward models specialize in stylistic preferences and identifying one frontier model with 78% agreement on detected mislabels versus 38% for peers.
Significance. If the central numerical claims and IRT-based mapping to mislabels are substantiated, the work offers a scalable, label-free method for auditing benchmarks using existing model response matrices. This could improve data quality in LLM training and evaluation pipelines. The scale (114 models, 7 benchmarks) and dual use for both mislabel detection and reward-model analysis are strengths.
major comments (3)
- [Abstract] Abstract: the headline claim of 95% precision on the top-200 mislabels is presented without any derivation details, error bars, ablation on the IRT fitting procedure, or description of how precision is computed against external ground truth, rendering the central empirical result unverifiable from the given information.
- [Abstract] Abstract: no model-fit diagnostics (residual correlations, item-fit statistics, or dimensionality tests) are reported. Given that the mislabel flag is derived directly from the fitted difficulty and discrimination parameters on the same response matrix, violation of unidimensionality or local independence would mean the parameters partly capture model idiosyncrasies rather than item quality, directly undermining the precision claim.
- [Abstract] Abstract: the statement that the IRT indicator outperforms a supervised classifier lacks any description of the baseline (features, training regime, or cross-validation), so the comparative claim cannot be assessed.
minor comments (1)
- [Abstract] The abstract would be clearer if it briefly named the IRT model variant (e.g., 2PL) and the exact number of items per benchmark.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify opportunities to improve the verifiability of the abstract claims. We address each point below and will make targeted revisions to the abstract and main text to incorporate additional methodological details while preserving the original results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of 95% precision on the top-200 mislabels is presented without any derivation details, error bars, ablation on the IRT fitting procedure, or description of how precision is computed against external ground truth, rendering the central empirical result unverifiable from the given information.
Authors: The abstract is intentionally concise. Precision at 95% for the top 200 is computed via manual review by two annotators of whether each flagged item has an incorrect benchmark label, with inter-annotator agreement reported in Section 4. Full IRT fitting details (2PL model, marginal maximum likelihood estimation), ablations on model count, and bootstrap-derived error bars appear in Sections 3.2 and 4.1. We will revise the abstract to include a one-sentence summary of the validation procedure and explicit section references. revision: yes
-
Referee: [Abstract] Abstract: no model-fit diagnostics (residual correlations, item-fit statistics, or dimensionality tests) are reported. Given that the mislabel flag is derived directly from the fitted difficulty and discrimination parameters on the same response matrix, violation of unidimensionality or local independence would mean the parameters partly capture model idiosyncrasies rather than item quality, directly undermining the precision claim.
Authors: We agree that fit diagnostics are necessary to support the IRT assumptions. The current manuscript does not report them in the main text. In revision we will add a methods subsection presenting eigenvalue-ratio tests for unidimensionality, item-fit statistics, and residual correlation checks, along with a short discussion confirming that the diagnostics support use of the parameters for mislabel detection. revision: yes
-
Referee: [Abstract] Abstract: the statement that the IRT indicator outperforms a supervised classifier lacks any description of the baseline (features, training regime, or cross-validation), so the comparative claim cannot be assessed.
Authors: The supervised baseline (logistic regression on per-item response proportions across the 114 models, trained with 5-fold cross-validation on 500 manually labeled items) and its performance numbers are described in Section 5 and Table 5. We will revise the abstract to add a brief clause describing the baseline features and cross-validation setup so the outperformance claim is self-contained. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper fits standard IRT models to the response matrix of 114 models across benchmarks and derives a mislabel indicator from the resulting difficulty/discrimination parameters. The headline performance (95% precision on top-200 flagged items) is obtained via external manual inspection of those items rather than any internal prediction or self-referential metric. No steps match the enumerated circularity patterns: there are no self-definitional reductions, no fitted inputs renamed as predictions, no load-bearing self-citations, and no imported uniqueness theorems or ansatzes. The derivation remains self-contained against the external human validation step.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Item Response Theory assumptions (unidimensional latent trait, local independence) apply to LLM answer patterns.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2511.04689 , year=
Human feedback is not gold standard. In The Twelfth International Conference on Learn- ing Representations. Jan-Christoph Klie, Bonnie Webber, and Iryna Gurevych. 2023. Annotation error detection: An- alyzing the past and present for a more coherent future.Computational Linguistics, 49(1):157– 198. John P. Lalor, Hao Wu, Tsendsuren Munkhdalai, and Hong Yu...
-
[2]
Detecting Pretraining Data from Large Language Models
tinybenchmarks: evaluating llms with fewer examples. InProceedings of the 41st In- ternational Conference on Machine Learning, ICML’24. JMLR.org. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bow- man. 2024. GPQA: A graduate-level google- proof Q&A benchmark. InFirst Confe...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Do large language model benchmarks test reliability?arXiv preprint arXiv:2502.03461, 2025
JudgeBench: A benchmark for evaluating LLM-based judges. InInternational Conference on Learning Representations (ICLR). ClaraVania, PhuMonHtut, WilliamHuang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, and Samuel R. Bowman.2021. Comparingtestsetswithitemre- sponse theory. InProceedings of the 59th Annual Meeting of the Asso...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.