pith. sign in

arxiv: 2604.03244 · v2 · pith:H32PD7AZnew · submitted 2026-02-27 · 💻 cs.AI · cs.CY· cs.DB

AI Evaluation Should Require Standardized Item-Level Data Releases

Pith reviewed 2026-05-25 06:44 UTC · model grok-4.3

classification 💻 cs.AI cs.CYcs.DB
keywords AI evaluationbenchmark validityitem-level dataconstruct alignmentreplicabilityauditabilitystandardized infrastructurecapability assessment
0
0 comments X

The pith

Standardized release of item-level benchmark responses must become default infrastructure for AI evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that AI evaluations currently rest on aggregate scores that cannot be checked for validity because they omit responses to individual test items. Without those details, it is impossible to determine whether items were well chosen, whether they align with the claimed construct, or whether results generalize beyond the specific sample. Releasing item-level data under a common schema would supply the evidence needed to audit evaluations, identify weak items, and document misalignment between what a benchmark intends to measure and what it actually measures. This change would replace unverified claims about model capabilities with replicable, auditable records.

Core claim

Designing valid evaluations requires empirical evidence from item-level model responses, and the standardized release of such data should be treated as core AI evaluation infrastructure because it enables transparency, replicability, and auditability while addressing root causes of underspecified item selection, construct misalignment, and poor generalization.

What carries the argument

Item-level model responses to benchmark items released under a unified schema, which supplies the raw observations needed to assess whether aggregate scores reflect genuine capability rather than artifacts of item choice.

If this is right

  • Validity claims about what a benchmark measures can be tested directly against observed response patterns.
  • Low-quality or misaligned items can be identified and removed from future use.
  • Research priorities can shift from improving aggregate numbers to correcting documented weaknesses in evaluation design.
  • Trust in deployed systems can be conditioned on the existence of verifiable item-level performance records.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmark maintainers might adopt iterative revision cycles where item analysis directly informs updates to the test set.
  • Cross-benchmark comparisons could become more informative once response patterns rather than single scores are available.
  • Evaluation practices in adjacent fields that rely on human testing might incorporate similar item-level transparency requirements.
  • Model development incentives could move away from optimizing for known aggregate metrics toward performance that holds up under item-level scrutiny.

Load-bearing premise

That the practical costs of releasing item-level data, such as contamination risk and author effort, remain smaller than the costs of decisions based on uncheckable aggregate scores.

What would settle it

A case in which releasing item-level responses produces no additional evidence of item quality problems or construct misalignment beyond what aggregate scores already reveal.

Figures

Figures reproduced from arXiv: 2604.03244 by Dongyao Zhu, Han Jiang, Sang T. Truong, Sanmi Koyejo, Susu Zhang, Xiaoyuan Yi, Xing Xie, Yuzhuo Bai, Ziang Xiao.

Figure 1
Figure 1. Figure 1: Benchmark-level accuracy distributions for 66 pre–Nov. 2023 models on MMLU and 72 post–Jun. 2024 models on MMLU￾Pro. Results are from HELM-Classic and HELM-Capabilities. can lead to unfair evaluations, which are nearly impossible to detect at benchmark level without explicit reporting by developers (Zhang et al., 2025). These issues are difficult to diagnose and address without item-level details. As shown… view at source ↗
Figure 3
Figure 3. Figure 3: ICCs for three items in MMLU. 5. Empirical Illustrations To illustrate the unique insights enabled by item-level benchmark data, we leverage item-level resources from HELM-Classic (v0.3.0) and HELM-Capabilities (Liang et al., 2023) to examine item characteristics and benchmark sub-constructs decomposition. 5.1. Item Characteristics from CTT An item’s statistical characteristics such as difficulty and discr… view at source ↗
Figure 4
Figure 4. Figure 4: Item clusters on BabiQA based on factor loadings. orange observations on the left indicates that a substantial proportion of MMLU-Pro items have very low difficulty. In other words, many items are no longer challenging for the 72 post-June 2024 models, suggesting fast benchmark saturation. (2) Compared to MMLU, item quality substan￾tially improved on the MMLU-Pro with much fewer items with low or negative … view at source ↗
Figure 5
Figure 5. Figure 5: Convergent/discriminant evidence of the four sub￾constructs (#1 - # 4) on MMLU-Pro. the OpenLLM Leaderboard v2 (Fourrier et al., 2024). We have been (1) collecting evaluation results to reduce the sparsity of the dataset-model matrix, and (2) incorporating external and interdisciplinary datasets. OpenEval now covers over 225k items from 64 benchmark datasets, with the number of evaluated models per dataset… view at source ↗
Figure 6
Figure 6. Figure 6: Schema for data entries in OpenEval. AI learning trajectories across samples with varying prop￾erties, informing decisions about training data composi￾tion, training paradigms, and the choice of proxy tasks and evaluation metrics. Moreover, item-level data supports a shift toward data-driven research paradigms in many ma￾chine learning subfields (Xu et al., 2024), including statisti￾cal learning, generaliz… view at source ↗
Figure 7
Figure 7. Figure 7: Convergent/discriminant evidence of the four sub-constructs (#1 - # 5) on MMLU. BabiQA (k=3) MMLU (k=5) MMLU-Pro (k=4) Item Clusters in GLRM Factor Space Omni-MATH (k=4) [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Clusters from four benchmark datasets in HELM revealed by K-means clustering over item factor loadings from GLRM. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example items with different maximum factor loadings within the same subject (psychology and physics) in MMLU-Pro. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

This position paper argues that standardized item-level benchmark data should become the default infrastructure for AI evaluation. Current evaluations suffer from underspecified item selection, construct misalignment, and poor generalization. The root cause of these failures is a misplaced focus on aggregate model scores. Without item-level evidence, validity claims cannot be assessed, resulting in inflated capability claims, misdirected research, and unwarranted trust in deployed systems. Our position is that designing valid evaluations requires empirical evidence from item-level model responses, and the standardized release of such data should be treated as core AI evaluation infrastructure. Such a release, in addition, enables transparency, replicability, and auditability of evaluation results. To show the norm is both feasible and consequential, we construct OpenEval, an item-level archive of 10M responses across 155k items from widely-used benchmarks, under a unified schema that the AI evaluation community can develop upon. We demonstrate how item-level data can identify low-quality items, document construct misalignment, and recover validity evidence about benchmarks' internal structure. We address objections around contamination and author burden, and show each is tractable relative to the cost of decisions made on claims that cannot be trusted.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This position paper argues that standardized item-level benchmark data releases should become default infrastructure for AI evaluation. It claims that reliance on aggregate model scores causes underspecified item selection, construct misalignment, and poor generalization because validity cannot be assessed without item responses; this leads to inflated capability claims, misdirected research, and unwarranted trust in systems. The authors support the position by constructing OpenEval—an archive of 10M responses across 155k items from widely-used benchmarks under a unified schema—and demonstrate its use for identifying low-quality items, documenting misalignment, and recovering evidence on benchmarks' internal structure. They address objections on contamination risk and author burden as tractable relative to the costs of untrustworthy claims.

Significance. If adopted, the position would improve the trustworthiness of AI evaluations by enabling empirical validity assessment, transparency, replicability, and auditability. The construction of OpenEval provides a concrete, community-extensible example of feasibility and directly illustrates diagnostic uses of item-level data, which is a strength for a position paper.

major comments (2)
  1. [Position statement and root-cause analysis] The central claim that aggregate scores are the root cause of the listed failures (underspecified selection, misalignment, poor generalization) is presented as self-evident in the position statement but lacks a systematic mapping or quantitative illustration showing how item-level data would have prevented each failure across the benchmarks included in OpenEval.
  2. [OpenEval construction and demonstrations] In the demonstrations, the recovery of validity evidence about internal structure is shown via the unified schema, but the paper does not report whether the 155k items were selected representatively or whether the schema itself introduces new alignment artifacts that could affect the claimed diagnostic power.
minor comments (2)
  1. [Abstract and § on OpenEval] The abstract states the scale of OpenEval (10M responses, 155k items) but the main text should include an early table or figure summarizing the source benchmarks and response counts for quick reference.
  2. [Methods / schema definition] Notation for the unified schema (e.g., fields for item ID, model response, gold label) should be defined explicitly with an example row to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The comments highlight opportunities to strengthen the explicit linkage between our position and the OpenEval demonstrations. We address each point below and will incorporate the suggested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Position statement and root-cause analysis] The central claim that aggregate scores are the root cause of the listed failures (underspecified selection, misalignment, poor generalization) is presented as self-evident in the position statement but lacks a systematic mapping or quantitative illustration showing how item-level data would have prevented each failure across the benchmarks included in OpenEval.

    Authors: We agree that an explicit mapping would make the causal argument more transparent. In the revision we will add a new table (and accompanying text) that, for each of the three failures, lists (a) a concrete example drawn from one of the OpenEval benchmarks, (b) the diagnostic that becomes possible only with item-level responses, and (c) the quantitative evidence (e.g., item-difficulty variance or response-pattern correlations) that aggregate scores alone cannot supply. This addition directly addresses the request for systematic illustration without altering the position paper’s core thesis. revision: yes

  2. Referee: [OpenEval construction and demonstrations] In the demonstrations, the recovery of validity evidence about internal structure is shown via the unified schema, but the paper does not report whether the 155k items were selected representatively or whether the schema itself introduces new alignment artifacts that could affect the claimed diagnostic power.

    Authors: The 155k items comprise the complete item sets of the source benchmarks (MMLU, GSM8K, HumanEval, etc.) rather than a subsample; we will state this explicitly and report the per-benchmark coverage percentages. The schema is deliberately minimal (prompt, model response, binary/continuous score, and provenance metadata) and mirrors fields already present in the original releases; we will add a short subsection discussing this design rationale and noting that the public release of both raw and schema-mapped data allows any schema-induced artifacts to be audited or removed by downstream users. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a position paper whose central claim—that item-level data releases are required for assessing evaluation validity—rests on explicit logical reasoning about construct misalignment and generalization failures, backed by the independent construction of OpenEval (10M responses, 155k items) as a concrete, publicly described archive. No equations, fitted parameters, or predictions appear; the demonstration of uses (identifying low-quality items, documenting misalignment) is performed on the constructed data rather than reducing to any input by definition. No self-citation chain is load-bearing for the necessity argument, and objections (contamination, burden) are addressed directly without invoking prior author work as an unverified uniqueness theorem. The argument is therefore self-contained against external benchmarks of validity evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the domain assumption that validity assessment in AI benchmarks requires item-level empirical evidence, which is drawn from psychometric principles rather than derived within the paper.

axioms (1)
  • domain assumption Validity claims about benchmarks cannot be assessed without item-level model response data.
    Stated directly in the abstract as the root cause of current evaluation failures.
invented entities (1)
  • OpenEval no independent evidence
    purpose: Unified item-level archive of benchmark responses to enable validity analysis.
    New archive constructed by the authors containing 10M responses across 155k items.

pith-pipeline@v0.9.0 · 5765 in / 1205 out tokens · 21360 ms · 2026-05-25T06:44:19.936687+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

  1. [1]

    acl-main.485/

    URL https://aclanthology.org/2020. acl-main.485/. Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H. Stereotyping Norwegian salmon: An in- ventory of pitfalls in fairness benchmark datasets. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.),Pro- ceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics and th...

  2. [2]

    naacl-main.385/

    URL https://aclanthology.org/2021. naacl-main.385/. Campbell, D. T. and Fiske, D. W. Convergent and discriminant validation by the multitrait-multimethod matrix.Psychological Bulletin, 56(2):81–105, 1959. ISSN 0033-2909 (Print), 1939-1455 (Electronic). doi: 10.1037/h0046016. URL https://doi.org/10. 1037/h0046016. Chiang, W.-L., Zheng, L., Sheng, Y ., Ange...

  3. [3]

    emnlp-main.699/

    URL https://aclanthology.org/2023. emnlp-main.699/. Cook, L. L. and Pitoniak, M. J. (eds.).Educational Measurement. Oxford University Press, 5 edition,

  4. [4]

    URL https: //global.oup.com/academic/product/ educational-measurement-9780197654965

    ISBN 978-0-19-765496-5. URL https: //global.oup.com/academic/product/ educational-measurement-9780197654965. Cronbach, L. J. and Meehl, P. E. Construct validity in psy- chological tests.Psychological Bulletin, 52(4):281–302, 9 Position: Science of AI Evaluation Requires Item-level Benchmark Data

  5. [5]

    doi: https://doi.org/10.1037/h0040957

    ISSN 1939-1455 (Electronic); 0033-2909 (Print). doi: https://doi.org/10.1037/h0040957. 60 references. (PsycInfo Database Record (c) 2025 APA, all rights re- served). Dehghani, M., Tay, Y ., Gritsenko, A. A., Zhao, Z., Houlsby, N., Diaz, F., Metzler, D., and Vinyals, O. The bench- mark lottery, 2021. URL https://arxiv.org/ abs/2107.07002. Deveci, ˙I. E. an...

  6. [6]

    Dongarra, J

    URL https://openreview.net/forum? id=0zDiyIGCFT. Dongarra, J. J., Moler, C. B., Bunch, J. R., and Stewart, G. W.LINPACK Users’ Guide. Society for Indus- trial and Applied Mathematics, 1979. doi: 10.1137/ 1.9781611971811. URL https://epubs.siam. org/doi/abs/10.1137/1.9781611971811. Du, M., Manjunatha, V ., Jain, R., Deshpande, R., Dernon- court, F., Gu, J....

  7. [7]

    findings-emnlp.301/

    URL https://aclanthology.org/2020. findings-emnlp.301/. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., and Jacobsen, H.-A. Bigbench: towards an in- dustry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pp. 1197–1208, New York, NY , USA, 2013. As...

  8. [8]

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J

    URL https://openreview.net/forum? id=sAFottNlra. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URL https:// openreview.net/forum?id=d7KBjmI3GmQ. Henrysson, S. Correction of item-total correlations in ...

  9. [9]

    Jacovi, A., Caciularu, A., Goldman, O., and Goldberg, Y

    URL https://openreview.net/forum? id=R0c2qtalgG. Jacovi, A., Caciularu, A., Goldman, O., and Goldberg, Y . Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation bench- marks. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Pr...

  10. [10]

    emnlp-main.308/

    URL https://aclanthology.org/2023. emnlp-main.308/. Jiang, H., Yi, X., Wei, Z., Xiao, Z., Wang, S., and Xie, X. Raising the bar: Investigating the values of large language models via generative evolving testing. InForty- second International Conference on Machine Learning, 2025a. URL https://openreview.net/forum? id=0REM9ydeLZ. Jiang, X., Chang, D., and X...

  11. [11]

    URL https: //aclanthology.org/2025.bea-1.69/

    doi: 10.18653/v1/2025.bea-1.69. URL https: //aclanthology.org/2025.bea-1.69/. Le Bras, R., Swayamdipta, S., Bhagavatula, C., Zellers, R., Peters, M. E., Sabharwal, A., and Choi, Y . Adversarial filters of dataset biases. InProceedings of the 37th In- ternational Conference on Machine Learning, ICML’20. JMLR.org, 2020. Li, F., Hogg, D. C., and Cohn, A. G. ...

  12. [12]

    Featured Certification, Expert Certi- fication, Outstanding Certification

    URL https://openreview.net/forum? id=iO4LZibEqW. Featured Certification, Expert Certi- fication, Outstanding Certification. Liao, Q. V . and Xiao, Z. Rethinking model evaluation as narrowing the socio-technical gap, 2025. URL https: //arxiv.org/abs/2306.03100. Lin, B. Y ., Deng, Y ., Chandu, K., Brahman, F., Ravichan- der, A., Pyatkin, V ., Dziri, N., Bra...

  13. [13]

    eacl-main.148/

    URL https://aclanthology.org/2023. eacl-main.148/. Nahum, O., Calderon, N., Keller, O., Szpektor, I., and Re- ichart, R. Are LLMs better than reported? detecting label errors and mitigating their effect on model perfor- mance. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Confer- ence on Empirical Metho...

  14. [14]

    Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

  15. [15]

    GPT-4 Technical Report

    URL https://aclanthology.org/2025. emnlp-main.1360/. Ni, J., Xue, F., Yue, X., Deng, Y ., Shah, M., Jain, K., Neubig, G., and You, Y . Mixeval: Deriving wisdom of the crowd from LLM benchmark mixtures. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/ forum?id=6A29LUZhfv. Nie, Y ., Williams, ...

  16. [16]

    naacl-main.200/

    URL https://aclanthology.org/2021. naacl-main.200/. Polo, F. M., Weber, L., Choshen, L., Sun, Y ., Xu, G., and Yurochkin, M. tinybenchmarks: evaluating llms with fewer examples. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. Raji, D., Denton, E., Bender, E. M., Hanna, A., and Paullada, A. Ai and the everyt...

  17. [17]

    Salaudeen, O

    URL https://openreview.net/forum? id=hcOq2buakM. Salaudeen, O. E., Reuel, A., Ahmed, A. M., Bedi, S., Robert- son, Z., Sundar, S., Domingue, B. W., Wang, A., and Koyejo, S. Measurement to meaning: A validity-centered framework for AI evaluation. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, ...

  18. [18]

    URL https://openreview.net/forum? id=Rxd2TpV6Eg. Son, H. S. Validity evaluation for the data used for artificial intelligence system. In Bi, Y ., Bhatia, R., and Kapoor, S. (eds.),Intelligent Systems and Applications, pp. 362–369, Cham, 2020. Springer International Publishing. ISBN 978-3-030-29516-5. S¨uhr, T., Dorner, F. E., Salaudeen, O., Kelava, A., an...

  19. [19]

    URL https: //doi.org/10.1037/0022-3514.84.3.608

    doi: 10.1037/0022-3514.84.3.608. URL https: //doi.org/10.1037/0022-3514.84.3.608. White, C., Dooley, S., Roberts, M., Pal, A., Feuer, B., Jain, S., Shwartz-Ziv, R., Jain, N., Saifullah, K., Dey, S., Shubh-Agrawal, Sandha, S. S., Naidu, S. V ., Hegde, C., LeCun, Y ., Goldstein, T., Neiswanger, W., and Goldblum, M. Livebench: A challenging, contamination-li...

  20. [20]

    emnlp-main.1173/

    URL https://aclanthology.org/2025. emnlp-main.1173/. Xu, X., Wu, Z., Qiao, R., Verma, A., Shu, Y ., Wang, J., Niu, X., He, Z., Chen, J., Zhou, Z., Lau, G. K. R., Dao, H., Agussurja, L., Sim, R. H. L., Lin, X., Hu, W., 15 Position: Science of AI Evaluation Requires Item-level Benchmark Data Dai, Z., Koh, P. W., and Low, B. K. H. Position pa- per: Data-cent...

  21. [21]

    findings-emnlp.695/

    URL https://aclanthology.org/2024. findings-emnlp.695/. Xu, Z., Xie, S., Lv, Q., Xiao, S., Song, L., Wenjuan, S., and Lin, F. Diagnosing failures in large language models’ answers: Integrating error attribution into eval- uation framework. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Findings of the As- sociation for Computational Ling...

  22. [22]

    A Survey of Large Language Models

    URL https://aclanthology.org/2025. findings-acl.1089/. Yao, J., Jin, P., Bao, K., Yu, Q., Bhardwaj, K., Su, C., Wang, J., ZHU, Y ., Devare, S., Mosk-Aoyama, D., Dong, Z., Srinivasan, V . K., Zhang, Y ., Kuchaiev, O., Jiao, J., and Zhu, B. The measure of all measures: Quantifying LLM benchmark quality. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM...