AI Evaluation Should Require Standardized Item-Level Data Releases

Dongyao Zhu; Han Jiang; Sang T. Truong; Sanmi Koyejo; Susu Zhang; Xiaoyuan Yi; Xing Xie; Yuzhuo Bai; Ziang Xiao

arxiv: 2604.03244 · v2 · pith:H32PD7AZnew · submitted 2026-02-27 · 💻 cs.AI · cs.CY· cs.DB

AI Evaluation Should Require Standardized Item-Level Data Releases

Han Jiang , Susu Zhang , Dongyao Zhu , Yuzhuo Bai , Sang T. Truong , Xiaoyuan Yi , Sanmi Koyejo , Xing Xie

show 1 more author

Ziang Xiao

This is my paper

Pith reviewed 2026-05-15 19:13 UTC · model grok-4.3

classification 💻 cs.AI cs.CYcs.DB

keywords AI evaluationbenchmarksitem-level datavalidity evidencepsychometricsgenerative AIbenchmark diagnostics

0 comments

The pith

Item-level AI benchmark data is essential for a rigorous science of AI evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current AI evaluations suffer from validity failures such as unjustified design choices and misaligned metrics because they operate only on aggregate scores. Item-level data would permit fine-grained analysis of individual test items and their underlying constructs, enabling the kind of diagnostic evidence collection that psychometrics already uses. Without this shift, the authors argue, high-stakes deployment of generative AI systems rests on untrustworthy foundations. They illustrate the point by re-examining evaluation practices across computer science and psychometrics and show how item properties become visible only at this resolution.

Core claim

Item-level benchmark data enables the fine-grained diagnostics and principled validation needed to address the unjustified design choices and misaligned metrics that plague current AI evaluations. By examining individual item properties and latent constructs, this data provides insights not available from summary statistics alone. The authors support their position through analysis of existing paradigms and propose a repository of such data to catalyze adoption.

What carries the argument

Item-level benchmark data that records performance on each individual test item rather than aggregate scores, allowing examination of item properties and latent constructs.

If this is right

Validity failures become diagnosable through properties of specific items rather than hidden in totals.
Evaluation methods can borrow and adapt validation techniques from psychometrics.
Benchmarks can be checked for alignment between measured constructs and intended claims.
Repositories of item-level data will support evidence-centered design of new evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many existing benchmarks may need redesign once item-level weaknesses become visible.
Developers could select or adapt benchmarks more precisely for particular deployment contexts.
The approach would align AI testing more closely with long-standing practices in educational and psychological measurement.

Load-bearing premise

That the main cause of current validity failures is simply the absence of item-level data and that providing it will enable rigorous validation without further changes to benchmark design or metric selection.

What would settle it

A benchmark that achieves clear, documented validity evidence using only aggregate scores without item-level data, or an item-level dataset that still leaves the same design and metric problems unresolved.

Figures

Figures reproduced from arXiv: 2604.03244 by Dongyao Zhu, Han Jiang, Sang T. Truong, Sanmi Koyejo, Susu Zhang, Xiaoyuan Yi, Xing Xie, Yuzhuo Bai, Ziang Xiao.

**Figure 1.** Figure 1: Benchmark-level accuracy distributions for 66 pre–Nov. 2023 models on MMLU and 72 post–Jun. 2024 models on MMLUPro. Results are from HELM-Classic and HELM-Capabilities. can lead to unfair evaluations, which are nearly impossible to detect at benchmark level without explicit reporting by developers (Zhang et al., 2025). These issues are difficult to diagnose and address without item-level details. As shown… view at source ↗

**Figure 3.** Figure 3: ICCs for three items in MMLU. 5. Empirical Illustrations To illustrate the unique insights enabled by item-level benchmark data, we leverage item-level resources from HELM-Classic (v0.3.0) and HELM-Capabilities (Liang et al., 2023) to examine item characteristics and benchmark sub-constructs decomposition. 5.1. Item Characteristics from CTT An item’s statistical characteristics such as difficulty and discr… view at source ↗

**Figure 4.** Figure 4: Item clusters on BabiQA based on factor loadings. orange observations on the left indicates that a substantial proportion of MMLU-Pro items have very low difficulty. In other words, many items are no longer challenging for the 72 post-June 2024 models, suggesting fast benchmark saturation. (2) Compared to MMLU, item quality substantially improved on the MMLU-Pro with much fewer items with low or negative … view at source ↗

**Figure 5.** Figure 5: Convergent/discriminant evidence of the four subconstructs (#1 - # 4) on MMLU-Pro. the OpenLLM Leaderboard v2 (Fourrier et al., 2024). We have been (1) collecting evaluation results to reduce the sparsity of the dataset-model matrix, and (2) incorporating external and interdisciplinary datasets. OpenEval now covers over 225k items from 64 benchmark datasets, with the number of evaluated models per dataset… view at source ↗

**Figure 6.** Figure 6: Schema for data entries in OpenEval. AI learning trajectories across samples with varying properties, informing decisions about training data composition, training paradigms, and the choice of proxy tasks and evaluation metrics. Moreover, item-level data supports a shift toward data-driven research paradigms in many machine learning subfields (Xu et al., 2024), including statistical learning, generaliz… view at source ↗

**Figure 7.** Figure 7: Convergent/discriminant evidence of the four sub-constructs (#1 - # 5) on MMLU. BabiQA (k=3) MMLU (k=5) MMLU-Pro (k=4) Item Clusters in GLRM Factor Space Omni-MATH (k=4) [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Clusters from four benchmark datasets in HELM revealed by K-means clustering over item factor loadings from GLRM. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Example items with different maximum factor loadings within the same subject (psychology and physics) in MMLU-Pro. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

This position paper argues that standardized item-level benchmark data should become the default infrastructure for AI evaluation. Current evaluations suffer from underspecified item selection, construct misalignment, and poor generalization. The root cause of these failures is a misplaced focus on aggregate model scores. Without item-level evidence, validity claims cannot be assessed, resulting in inflated capability claims, misdirected research, and unwarranted trust in deployed systems. Our position is that designing valid evaluations requires empirical evidence from item-level model responses, and the standardized release of such data should be treated as core AI evaluation infrastructure. Such a release, in addition, enables transparency, replicability, and auditability of evaluation results. To show the norm is both feasible and consequential, we construct OpenEval, an item-level archive of 10M responses across 155k items from widely-used benchmarks, under a unified schema that the AI evaluation community can develop upon. We demonstrate how item-level data can identify low-quality items, document construct misalignment, and recover validity evidence about benchmarks' internal structure. We address objections around contamination and author burden, and show each is tractable relative to the cost of decisions made on claims that cannot be trusted.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Item-level data could sharpen AI evaluations, but the paper needs to prove it works rather than just advocate for it.

read the letter

The main takeaway is that AI benchmark evaluations need item-level data to move beyond opaque aggregate scores and enable real diagnostic work. Without it, we can't easily see why a model fails on certain items or validate what the benchmark is actually measuring. The authors make a reasonable case by linking current validity problems to the lack of granular data. They review issues like unjustified design choices and misaligned metrics, then show through examples how item analysis from psychometrics could apply here. The OpenEval repository is a useful addition that provides actual data for others to use and build on. This kind of resource could help standardize practices across the community. A soft spot is the lack of evidence that item-level data alone would resolve the systemic failures. The paper uses illustrative cases rather than showing a before-and-after or a controlled test where access to item data improves validation. The assumption that this data will lead to principled frameworks might overlook the need for better overall benchmark construction methods. It would be stronger if they had included a small study demonstrating the difference. This paper targets researchers focused on AI evaluation methodology. Readers who want to improve how we assess models for real-world use will find the discussion relevant. It deserves peer review because the core idea is sound and the repo gives it a practical edge, though revisions could add more concrete demonstrations.

Referee Report

1 major / 2 minor

Summary. This position paper argues that item-level AI benchmark data is essential for establishing a rigorous science of AI evaluation. Current evaluation paradigms suffer from systemic validity failures (unjustified design choices, misaligned metrics) that cannot be addressed without granular diagnostics and principled validation evidence. The manuscript substantiates the position by reviewing failures, contrasting paradigms from psychometrics and computer science, presenting illustrative item-level analyses of properties and latent constructs, and introducing the OpenEval repository to support community adoption of evidence-centered evaluation.

Significance. If the position is adopted, the field would move toward more reliable, diagnostically rich evaluations suitable for high-stakes deployment of generative AI. The OpenEval repository, if populated and maintained, would provide a concrete infrastructure for reproducibility and validity studies, addressing a recognized gap between current benchmark practices and psychometric standards for evidence collection.

major comments (1)

The central claim that item-level data will by itself enable principled validation (Abstract and main argument) rests on illustrative examples rather than a controlled comparison or quantitative demonstration that supplying such data resolves the described validity failures without concurrent changes to benchmark design or metric selection; this assumption is load-bearing for the advocated paradigm shift.

minor comments (2)

The description of OpenEval would benefit from explicit details on data schema, licensing, and mechanisms for community contribution to ensure long-term utility.
A brief table or figure summarizing the specific validity failures discussed and the corresponding item-level diagnostics would improve readability and strengthen the illustrative analyses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The feedback helps sharpen the scope of our position paper. We respond to the major comment below.

read point-by-point responses

Referee: The central claim that item-level data will by itself enable principled validation (Abstract and main argument) rests on illustrative examples rather than a controlled comparison or quantitative demonstration that supplying such data resolves the described validity failures without concurrent changes to benchmark design or metric selection; this assumption is load-bearing for the advocated paradigm shift.

Authors: We agree that item-level data alone does not automatically resolve validity failures and that concurrent changes to benchmark design and metric selection are typically required. Our position is that item-level data is a necessary enabling condition for such changes to be evidence-based, because aggregate scores inherently obscure the item properties and latent-construct mismatches that must be diagnosed. The illustrative analyses (Sections 4 and 5) are deliberately chosen to show concrete cases in which item-level inspection reveals misalignments invisible at the aggregate level; these cases serve as existence proofs rather than exhaustive validation. As a position paper, our goal is to articulate why the field requires this data infrastructure and to introduce OpenEval as the mechanism for accumulating the quantitative comparisons the referee correctly identifies as ultimately needed. We have revised the abstract and the concluding section to state explicitly that item-level data is necessary but not sufficient, and that the repository is intended to support the controlled studies that will test sufficiency. revision: partial

Circularity Check

0 steps flagged

Position paper with external literature support; no circular derivation

full rationale

This is a position paper advocating item-level benchmark data for rigorous AI evaluation science. It describes current validity failures, draws parallels to external psychometrics and CS evaluation paradigms, and presents illustrative analyses plus the OpenEval repository. No formal derivation, equations, or theorems exist; the central claim is explicitly advocative rather than deductive. It references external literature without load-bearing self-citation chains or reductions of predictions to fitted inputs by construction. The argument remains self-contained against external benchmarks and does not reduce any claim to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that validity failures are intractable without item-level data and on the introduction of OpenEval as a supporting resource; no free parameters or new physical entities are postulated.

axioms (1)

domain assumption Current AI evaluation paradigms exhibit systemic validity failures ranging from unjustified design choices to misaligned metrics that remain intractable without item-level data.
Stated directly in the abstract as the premise requiring the proposed solution.

invented entities (1)

OpenEval no independent evidence
purpose: Growing repository of item-level benchmark data to support evidence-centered AI evaluation.
New resource introduced to catalyze adoption; no independent evidence of its contents or impact is provided in the abstract.

pith-pipeline@v0.9.0 · 5445 in / 1318 out tokens · 34685 ms · 2026-05-15T19:13:38.408450+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

item factor analysis (IFA) ... SVD-based ... GLRM ... factor loadings ... K-means clustering

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.