Auditing LLM Benchmarks with Item Response Theory

Daniel M. Bikel; Sander Land

arxiv: 2605.30504 · v1 · pith:ID2IWZPZnew · submitted 2026-05-28 · 💻 cs.CL

Auditing LLM Benchmarks with Item Response Theory

Sander Land , Daniel M. Bikel This is my paper

Pith reviewed 2026-06-29 07:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords item response theoryllm benchmarksmislabel detectionbenchmark auditingreward modelspreference evaluationmultiple choice datasets

0 comments

The pith

Item Response Theory applied to 114 models identifies mislabeled items in LLM benchmarks at 95% precision among the top 200 examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that an Item Response Theory model fitted to answer patterns from 114 language models can flag likely errors in benchmark labels across seven preference and multiple-choice datasets. This indicator reaches 95% precision in its top 200 detections and beats a supervised classifier. Errors trace back to mechanical labeling rules, mistakes copied from earlier datasets, and questions that lack a single correct answer. The same model fit indicates that reward models mainly learn stylistic preferences rather than factual content, with one frontier model matching the flagged mislabels 78% of the time. Readers should care because these benchmarks are reused to train and score new models, so label errors spread into downstream systems.

Core claim

By fitting an IRT model to binary responses from 114 models, the authors obtain per-item difficulty and discrimination parameters that surface likely mislabels at 95% precision in the top 200 examples across seven benchmarks, outperforming a supervised classifier. The errors arise from mechanical labeling heuristics, upstream annotation mistakes inherited from source datasets, and fundamentally ambiguous items. The fitted model also shows reward models specialize in stylistic preference rather than factual knowledge, and identifies one frontier reward model that agrees with the detected mislabels at 78% accuracy versus 38% for its peers.

What carries the argument

The unidimensional Item Response Theory model that estimates difficulty and discrimination parameters for each benchmark item from the pattern of correct and incorrect responses across many models.

If this is right

Benchmark errors commonly originate from simple labeling heuristics or mistakes copied from source datasets.
Reward models capture stylistic preferences far more than factual knowledge.
One frontier reward model aligns with detected mislabels at 78% accuracy, consistent with contamination or benchmark-specific over-optimization.
The IRT approach provides a label-free way to audit benchmarks that outperforms training a supervised classifier on the same responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Periodic re-application of this auditing step could keep benchmark quality from degrading as new models are released.
The observed specialization of reward models suggests current preference data may not drive gains in factual reasoning.
The same response-pattern analysis could be tested on open-ended generation benchmarks to check for similar label problems.

Load-bearing premise

The IRT assumptions of a single latent ability dimension and locally independent responses hold well enough for LLM answer patterns that the resulting parameters reliably point to label errors rather than model idiosyncrasies.

What would settle it

Independent expert review of the top 200 items flagged by the IRT indicator showing precision below 80% would show the method does not reliably surface true mislabels.

Figures

Figures reproduced from arXiv: 2605.30504 by Daniel M. Bikel, Sander Land.

**Figure 1.** Figure 1: From one bad label to a systematic signal. Top: an RM-Bench Chat Easy item whose reference answer is plainly wrong. Bottom: Our indicator separates label errors with high precision: flagged items are mislabeled or subjective 81% of the time, compared with 3% of unflagged items; precision reaches 95% among the top 200. We address both with Item Response Theory (IRT; Hambleton et al., 1991), a psychometric… view at source ↗

**Figure 2.** Figure 2: Item difficulty bi vs. ceiling di from the 4PL fit, split by weak reference label from the GPT-5.4 aggregator. Items labeled mislabel form a distinct low-ceiling population near di ≈ ci , while items labeled label_correct concentrate near di ≈ 1 across all difficulties. −2 −1 0 1 2 model ability θm incorrect (0) 0.5 correct (1) o bserved resp o nse yim Hard, correctly labelled item Δℓi = −0.24 → keep −2 −1… view at source ↗

**Figure 3.** Figure 3: Forced-ceiling contrast on two real items. Dots show model responses versus ability; curves [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Three recurring ways benchmark labels fail. Representative high-∆ℓi audit items show labels caused by verifier artifacts, inherited source errors, and items without a defensible single key. of the unsupervised indicator (∆ℓi < −0.05, |∆ℓi | ≤ 0.05, ∆ℓi > 0.05). On preference benchmarks, most reward models stay within the non-reward distributions and show no systematic tendency to agree with bad references… view at source ↗

**Figure 5.** Figure 5: Per-benchmark ability deviation ∆θm,s = θm,s − ¯θm, after averaging subset-level deviations within benchmark families. Rows show all six reward models, followed by the eight generative models with the largest total specialization P s |∆θm,s| and the two with the smallest; the right bar reports this total. Cells are annotated when |∆θm,s| > 0.1. 0% 20% 40% 60% 80% 100% Accuracy Weak ref. correct Δℓ < −0.05 … view at source ↗

**Figure 6.** Figure 6: Reward-model behavior on weak reference labels and [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Constraint-box sensitivity. Lollipops show percentage-point change from the default forced [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Preliminary-filter sensitivity. Left: strict P@200 and mislabel+subjective precision as [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Held-clean GPQA items appearing in the top [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We introduce an Item Response Theory-based indicator that surfaces likely mislabels at 95% precision in the top 200 examples across seven preference and multiple-choice benchmarks using responses from 114 models, outperforming a supervised classifier. We trace these errors to mechanical labeling heuristics, upstream annotation mistakes inherited unchanged from source datasets, and fundamentally ambiguous items without a defensible single label. The same model fit reveals that reward models specialize in stylistic preference rather than factual knowledge, and identifies one frontier reward model that agrees with detected mislabels at 78% accuracy versus 38% for its peers, consistent with benchmark contamination or benchmark-specific over-optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IRT on 114-model response matrices flags benchmark mislabels at high precision but the unidimensional assumption needs explicit checks.

read the letter

The paper fits Item Response Theory to a response matrix from 114 models on seven benchmarks and uses the resulting difficulty and discrimination parameters to surface likely mislabels. It reports 95% precision on the top 200 flagged items, beats a supervised baseline, traces errors to heuristics and inherited annotation mistakes, and adds that most reward models align with stylistic cues rather than facts while one outlier matches the flagged mislabels at 78%.

The scale is the real contribution. Collecting answers from that many models gives a dense enough matrix to estimate stable item parameters without extra human labels, and the downstream reward-model observation follows directly from the same fit. That part is straightforward and useful for anyone thinking about what preference data actually rewards.

The soft spot is the IRT assumptions. LLM families differ in architecture and training, so response patterns are likely to show residual correlations or multiple latent dimensions. If local independence or unidimensionality fails, the parameters will partly reflect model idiosyncrasies instead of item quality, which would inflate the apparent precision. The abstract gives no dimensionality tests, item-fit statistics, or residual checks, so it is hard to judge how much the 95% number depends on those assumptions holding.

This is for groups that maintain benchmarks or study reward-model behavior. A reader working on evaluation infrastructure would get concrete examples and a workable detection method. It deserves peer review because the approach is applied at scale and the findings are falsifiable with the right diagnostics, even if the current write-up leaves the fit quality open.

Referee Report

3 major / 1 minor

Summary. The paper introduces an Item Response Theory (IRT)-based indicator to detect likely mislabeled items in LLM benchmarks. Fitting IRT models to responses from 114 models across seven preference and multiple-choice benchmarks, it claims 95% precision on the top 200 flagged examples, outperforming a supervised classifier. It traces errors to labeling heuristics, inherited annotation mistakes, and ambiguous items, while also showing that reward models specialize in stylistic preferences and identifying one frontier model with 78% agreement on detected mislabels versus 38% for peers.

Significance. If the central numerical claims and IRT-based mapping to mislabels are substantiated, the work offers a scalable, label-free method for auditing benchmarks using existing model response matrices. This could improve data quality in LLM training and evaluation pipelines. The scale (114 models, 7 benchmarks) and dual use for both mislabel detection and reward-model analysis are strengths.

major comments (3)

[Abstract] Abstract: the headline claim of 95% precision on the top-200 mislabels is presented without any derivation details, error bars, ablation on the IRT fitting procedure, or description of how precision is computed against external ground truth, rendering the central empirical result unverifiable from the given information.
[Abstract] Abstract: no model-fit diagnostics (residual correlations, item-fit statistics, or dimensionality tests) are reported. Given that the mislabel flag is derived directly from the fitted difficulty and discrimination parameters on the same response matrix, violation of unidimensionality or local independence would mean the parameters partly capture model idiosyncrasies rather than item quality, directly undermining the precision claim.
[Abstract] Abstract: the statement that the IRT indicator outperforms a supervised classifier lacks any description of the baseline (features, training regime, or cross-validation), so the comparative claim cannot be assessed.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly named the IRT model variant (e.g., 2PL) and the exact number of items per benchmark.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify opportunities to improve the verifiability of the abstract claims. We address each point below and will make targeted revisions to the abstract and main text to incorporate additional methodological details while preserving the original results.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of 95% precision on the top-200 mislabels is presented without any derivation details, error bars, ablation on the IRT fitting procedure, or description of how precision is computed against external ground truth, rendering the central empirical result unverifiable from the given information.

Authors: The abstract is intentionally concise. Precision at 95% for the top 200 is computed via manual review by two annotators of whether each flagged item has an incorrect benchmark label, with inter-annotator agreement reported in Section 4. Full IRT fitting details (2PL model, marginal maximum likelihood estimation), ablations on model count, and bootstrap-derived error bars appear in Sections 3.2 and 4.1. We will revise the abstract to include a one-sentence summary of the validation procedure and explicit section references. revision: yes
Referee: [Abstract] Abstract: no model-fit diagnostics (residual correlations, item-fit statistics, or dimensionality tests) are reported. Given that the mislabel flag is derived directly from the fitted difficulty and discrimination parameters on the same response matrix, violation of unidimensionality or local independence would mean the parameters partly capture model idiosyncrasies rather than item quality, directly undermining the precision claim.

Authors: We agree that fit diagnostics are necessary to support the IRT assumptions. The current manuscript does not report them in the main text. In revision we will add a methods subsection presenting eigenvalue-ratio tests for unidimensionality, item-fit statistics, and residual correlation checks, along with a short discussion confirming that the diagnostics support use of the parameters for mislabel detection. revision: yes
Referee: [Abstract] Abstract: the statement that the IRT indicator outperforms a supervised classifier lacks any description of the baseline (features, training regime, or cross-validation), so the comparative claim cannot be assessed.

Authors: The supervised baseline (logistic regression on per-item response proportions across the 114 models, trained with 5-fold cross-validation on 500 manually labeled items) and its performance numbers are described in Section 5 and Table 5. We will revise the abstract to add a brief clause describing the baseline features and cross-validation setup so the outperformance claim is self-contained. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper fits standard IRT models to the response matrix of 114 models across benchmarks and derives a mislabel indicator from the resulting difficulty/discrimination parameters. The headline performance (95% precision on top-200 flagged items) is obtained via external manual inspection of those items rather than any internal prediction or self-referential metric. No steps match the enumerated circularity patterns: there are no self-definitional reductions, no fitted inputs renamed as predictions, no load-bearing self-citations, and no imported uniqueness theorems or ansatzes. The derivation remains self-contained against the external human validation step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the IRT model for LLM responses and on the assumption that the 114-model sample is representative; no free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Item Response Theory assumptions (unidimensional latent trait, local independence) apply to LLM answer patterns.
Required for the fitted parameters to be interpretable as difficulty and discrimination rather than artifacts.

pith-pipeline@v0.9.1-grok · 5639 in / 1243 out tokens · 22535 ms · 2026-06-29T07:30:04.133449+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2511.04689 , year=

Human feedback is not gold standard. In The Twelfth International Conference on Learn- ing Representations. Jan-Christoph Klie, Bonnie Webber, and Iryna Gurevych. 2023. Annotation error detection: An- alyzing the past and present for a more coherent future.Computational Linguistics, 49(1):157– 198. John P. Lalor, Hao Wu, Tsendsuren Munkhdalai, and Hong Yu...

work page arXiv 2023
[2]

Detecting Pretraining Data from Large Language Models

tinybenchmarks: evaluating llms with fewer examples. InProceedings of the 41st In- ternational Conference on Machine Learning, ICML’24. JMLR.org. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bow- man. 2024. GPQA: A graduate-level google- proof Q&A benchmark. InFirst Confe...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Do large language model benchmarks test reliability?arXiv preprint arXiv:2502.03461, 2025

JudgeBench: A benchmark for evaluating LLM-based judges. InInternational Conference on Learning Representations (ICLR). ClaraVania, PhuMonHtut, WilliamHuang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, and Samuel R. Bowman.2021. Comparingtestsetswithitemre- sponse theory. InProceedings of the 59th Annual Meeting of the Asso...

work page arXiv 2021

[1] [1]

arXiv preprint arXiv:2511.04689 , year=

Human feedback is not gold standard. In The Twelfth International Conference on Learn- ing Representations. Jan-Christoph Klie, Bonnie Webber, and Iryna Gurevych. 2023. Annotation error detection: An- alyzing the past and present for a more coherent future.Computational Linguistics, 49(1):157– 198. John P. Lalor, Hao Wu, Tsendsuren Munkhdalai, and Hong Yu...

work page arXiv 2023

[2] [2]

Detecting Pretraining Data from Large Language Models

tinybenchmarks: evaluating llms with fewer examples. InProceedings of the 41st In- ternational Conference on Machine Learning, ICML’24. JMLR.org. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bow- man. 2024. GPQA: A graduate-level google- proof Q&A benchmark. InFirst Confe...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Do large language model benchmarks test reliability?arXiv preprint arXiv:2502.03461, 2025

JudgeBench: A benchmark for evaluating LLM-based judges. InInternational Conference on Learning Representations (ICLR). ClaraVania, PhuMonHtut, WilliamHuang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, and Samuel R. Bowman.2021. Comparingtestsetswithitemre- sponse theory. InProceedings of the 59th Annual Meeting of the Asso...

work page arXiv 2021