AI Evaluation Should Require Standardized Item-Level Data Releases
Pith reviewed 2026-05-15 19:13 UTC · model grok-4.3
The pith
Item-level AI benchmark data is essential for a rigorous science of AI evaluation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Item-level benchmark data enables the fine-grained diagnostics and principled validation needed to address the unjustified design choices and misaligned metrics that plague current AI evaluations. By examining individual item properties and latent constructs, this data provides insights not available from summary statistics alone. The authors support their position through analysis of existing paradigms and propose a repository of such data to catalyze adoption.
What carries the argument
Item-level benchmark data that records performance on each individual test item rather than aggregate scores, allowing examination of item properties and latent constructs.
If this is right
- Validity failures become diagnosable through properties of specific items rather than hidden in totals.
- Evaluation methods can borrow and adapt validation techniques from psychometrics.
- Benchmarks can be checked for alignment between measured constructs and intended claims.
- Repositories of item-level data will support evidence-centered design of new evaluations.
Where Pith is reading between the lines
- Many existing benchmarks may need redesign once item-level weaknesses become visible.
- Developers could select or adapt benchmarks more precisely for particular deployment contexts.
- The approach would align AI testing more closely with long-standing practices in educational and psychological measurement.
Load-bearing premise
That the main cause of current validity failures is simply the absence of item-level data and that providing it will enable rigorous validation without further changes to benchmark design or metric selection.
What would settle it
A benchmark that achieves clear, documented validity evidence using only aggregate scores without item-level data, or an item-level dataset that still leaves the same design and metric problems unresolved.
Figures
read the original abstract
This position paper argues that standardized item-level benchmark data should become the default infrastructure for AI evaluation. Current evaluations suffer from underspecified item selection, construct misalignment, and poor generalization. The root cause of these failures is a misplaced focus on aggregate model scores. Without item-level evidence, validity claims cannot be assessed, resulting in inflated capability claims, misdirected research, and unwarranted trust in deployed systems. Our position is that designing valid evaluations requires empirical evidence from item-level model responses, and the standardized release of such data should be treated as core AI evaluation infrastructure. Such a release, in addition, enables transparency, replicability, and auditability of evaluation results. To show the norm is both feasible and consequential, we construct OpenEval, an item-level archive of 10M responses across 155k items from widely-used benchmarks, under a unified schema that the AI evaluation community can develop upon. We demonstrate how item-level data can identify low-quality items, document construct misalignment, and recover validity evidence about benchmarks' internal structure. We address objections around contamination and author burden, and show each is tractable relative to the cost of decisions made on claims that cannot be trusted.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper argues that item-level AI benchmark data is essential for establishing a rigorous science of AI evaluation. Current evaluation paradigms suffer from systemic validity failures (unjustified design choices, misaligned metrics) that cannot be addressed without granular diagnostics and principled validation evidence. The manuscript substantiates the position by reviewing failures, contrasting paradigms from psychometrics and computer science, presenting illustrative item-level analyses of properties and latent constructs, and introducing the OpenEval repository to support community adoption of evidence-centered evaluation.
Significance. If the position is adopted, the field would move toward more reliable, diagnostically rich evaluations suitable for high-stakes deployment of generative AI. The OpenEval repository, if populated and maintained, would provide a concrete infrastructure for reproducibility and validity studies, addressing a recognized gap between current benchmark practices and psychometric standards for evidence collection.
major comments (1)
- The central claim that item-level data will by itself enable principled validation (Abstract and main argument) rests on illustrative examples rather than a controlled comparison or quantitative demonstration that supplying such data resolves the described validity failures without concurrent changes to benchmark design or metric selection; this assumption is load-bearing for the advocated paradigm shift.
minor comments (2)
- The description of OpenEval would benefit from explicit details on data schema, licensing, and mechanisms for community contribution to ensure long-term utility.
- A brief table or figure summarizing the specific validity failures discussed and the corresponding item-level diagnostics would improve readability and strengthen the illustrative analyses.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation of minor revision. The feedback helps sharpen the scope of our position paper. We respond to the major comment below.
read point-by-point responses
-
Referee: The central claim that item-level data will by itself enable principled validation (Abstract and main argument) rests on illustrative examples rather than a controlled comparison or quantitative demonstration that supplying such data resolves the described validity failures without concurrent changes to benchmark design or metric selection; this assumption is load-bearing for the advocated paradigm shift.
Authors: We agree that item-level data alone does not automatically resolve validity failures and that concurrent changes to benchmark design and metric selection are typically required. Our position is that item-level data is a necessary enabling condition for such changes to be evidence-based, because aggregate scores inherently obscure the item properties and latent-construct mismatches that must be diagnosed. The illustrative analyses (Sections 4 and 5) are deliberately chosen to show concrete cases in which item-level inspection reveals misalignments invisible at the aggregate level; these cases serve as existence proofs rather than exhaustive validation. As a position paper, our goal is to articulate why the field requires this data infrastructure and to introduce OpenEval as the mechanism for accumulating the quantitative comparisons the referee correctly identifies as ultimately needed. We have revised the abstract and the concluding section to state explicitly that item-level data is necessary but not sufficient, and that the repository is intended to support the controlled studies that will test sufficiency. revision: partial
Circularity Check
Position paper with external literature support; no circular derivation
full rationale
This is a position paper advocating item-level benchmark data for rigorous AI evaluation science. It describes current validity failures, draws parallels to external psychometrics and CS evaluation paradigms, and presents illustrative analyses plus the OpenEval repository. No formal derivation, equations, or theorems exist; the central claim is explicitly advocative rather than deductive. It references external literature without load-bearing self-citation chains or reductions of predictions to fitted inputs by construction. The argument remains self-contained against external benchmarks and does not reduce any claim to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Current AI evaluation paradigms exhibit systemic validity failures ranging from unjustified design choices to misaligned metrics that remain intractable without item-level data.
invented entities (1)
-
OpenEval
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
item factor analysis (IFA) ... SVD-based ... GLRM ... factor loadings ... K-means clustering
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.