AI Evaluation Should Require Standardized Item-Level Data Releases
Pith reviewed 2026-05-25 06:44 UTC · model grok-4.3
The pith
Standardized release of item-level benchmark responses must become default infrastructure for AI evaluation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Designing valid evaluations requires empirical evidence from item-level model responses, and the standardized release of such data should be treated as core AI evaluation infrastructure because it enables transparency, replicability, and auditability while addressing root causes of underspecified item selection, construct misalignment, and poor generalization.
What carries the argument
Item-level model responses to benchmark items released under a unified schema, which supplies the raw observations needed to assess whether aggregate scores reflect genuine capability rather than artifacts of item choice.
If this is right
- Validity claims about what a benchmark measures can be tested directly against observed response patterns.
- Low-quality or misaligned items can be identified and removed from future use.
- Research priorities can shift from improving aggregate numbers to correcting documented weaknesses in evaluation design.
- Trust in deployed systems can be conditioned on the existence of verifiable item-level performance records.
Where Pith is reading between the lines
- Benchmark maintainers might adopt iterative revision cycles where item analysis directly informs updates to the test set.
- Cross-benchmark comparisons could become more informative once response patterns rather than single scores are available.
- Evaluation practices in adjacent fields that rely on human testing might incorporate similar item-level transparency requirements.
- Model development incentives could move away from optimizing for known aggregate metrics toward performance that holds up under item-level scrutiny.
Load-bearing premise
That the practical costs of releasing item-level data, such as contamination risk and author effort, remain smaller than the costs of decisions based on uncheckable aggregate scores.
What would settle it
A case in which releasing item-level responses produces no additional evidence of item quality problems or construct misalignment beyond what aggregate scores already reveal.
Figures
read the original abstract
This position paper argues that standardized item-level benchmark data should become the default infrastructure for AI evaluation. Current evaluations suffer from underspecified item selection, construct misalignment, and poor generalization. The root cause of these failures is a misplaced focus on aggregate model scores. Without item-level evidence, validity claims cannot be assessed, resulting in inflated capability claims, misdirected research, and unwarranted trust in deployed systems. Our position is that designing valid evaluations requires empirical evidence from item-level model responses, and the standardized release of such data should be treated as core AI evaluation infrastructure. Such a release, in addition, enables transparency, replicability, and auditability of evaluation results. To show the norm is both feasible and consequential, we construct OpenEval, an item-level archive of 10M responses across 155k items from widely-used benchmarks, under a unified schema that the AI evaluation community can develop upon. We demonstrate how item-level data can identify low-quality items, document construct misalignment, and recover validity evidence about benchmarks' internal structure. We address objections around contamination and author burden, and show each is tractable relative to the cost of decisions made on claims that cannot be trusted.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper argues that standardized item-level benchmark data releases should become default infrastructure for AI evaluation. It claims that reliance on aggregate model scores causes underspecified item selection, construct misalignment, and poor generalization because validity cannot be assessed without item responses; this leads to inflated capability claims, misdirected research, and unwarranted trust in systems. The authors support the position by constructing OpenEval—an archive of 10M responses across 155k items from widely-used benchmarks under a unified schema—and demonstrate its use for identifying low-quality items, documenting misalignment, and recovering evidence on benchmarks' internal structure. They address objections on contamination risk and author burden as tractable relative to the costs of untrustworthy claims.
Significance. If adopted, the position would improve the trustworthiness of AI evaluations by enabling empirical validity assessment, transparency, replicability, and auditability. The construction of OpenEval provides a concrete, community-extensible example of feasibility and directly illustrates diagnostic uses of item-level data, which is a strength for a position paper.
major comments (2)
- [Position statement and root-cause analysis] The central claim that aggregate scores are the root cause of the listed failures (underspecified selection, misalignment, poor generalization) is presented as self-evident in the position statement but lacks a systematic mapping or quantitative illustration showing how item-level data would have prevented each failure across the benchmarks included in OpenEval.
- [OpenEval construction and demonstrations] In the demonstrations, the recovery of validity evidence about internal structure is shown via the unified schema, but the paper does not report whether the 155k items were selected representatively or whether the schema itself introduces new alignment artifacts that could affect the claimed diagnostic power.
minor comments (2)
- [Abstract and § on OpenEval] The abstract states the scale of OpenEval (10M responses, 155k items) but the main text should include an early table or figure summarizing the source benchmarks and response counts for quick reference.
- [Methods / schema definition] Notation for the unified schema (e.g., fields for item ID, model response, gold label) should be defined explicitly with an example row to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. The comments highlight opportunities to strengthen the explicit linkage between our position and the OpenEval demonstrations. We address each point below and will incorporate the suggested clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [Position statement and root-cause analysis] The central claim that aggregate scores are the root cause of the listed failures (underspecified selection, misalignment, poor generalization) is presented as self-evident in the position statement but lacks a systematic mapping or quantitative illustration showing how item-level data would have prevented each failure across the benchmarks included in OpenEval.
Authors: We agree that an explicit mapping would make the causal argument more transparent. In the revision we will add a new table (and accompanying text) that, for each of the three failures, lists (a) a concrete example drawn from one of the OpenEval benchmarks, (b) the diagnostic that becomes possible only with item-level responses, and (c) the quantitative evidence (e.g., item-difficulty variance or response-pattern correlations) that aggregate scores alone cannot supply. This addition directly addresses the request for systematic illustration without altering the position paper’s core thesis. revision: yes
-
Referee: [OpenEval construction and demonstrations] In the demonstrations, the recovery of validity evidence about internal structure is shown via the unified schema, but the paper does not report whether the 155k items were selected representatively or whether the schema itself introduces new alignment artifacts that could affect the claimed diagnostic power.
Authors: The 155k items comprise the complete item sets of the source benchmarks (MMLU, GSM8K, HumanEval, etc.) rather than a subsample; we will state this explicitly and report the per-benchmark coverage percentages. The schema is deliberately minimal (prompt, model response, binary/continuous score, and provenance metadata) and mirrors fields already present in the original releases; we will add a short subsection discussing this design rationale and noting that the public release of both raw and schema-mapped data allows any schema-induced artifacts to be audited or removed by downstream users. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper is a position paper whose central claim—that item-level data releases are required for assessing evaluation validity—rests on explicit logical reasoning about construct misalignment and generalization failures, backed by the independent construction of OpenEval (10M responses, 155k items) as a concrete, publicly described archive. No equations, fitted parameters, or predictions appear; the demonstration of uses (identifying low-quality items, documenting misalignment) is performed on the constructed data rather than reducing to any input by definition. No self-citation chain is load-bearing for the necessity argument, and objections (contamination, burden) are addressed directly without invoking prior author work as an unverified uniqueness theorem. The argument is therefore self-contained against external benchmarks of validity evidence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Validity claims about benchmarks cannot be assessed without item-level model response data.
invented entities (1)
-
OpenEval
no independent evidence
Reference graph
Works this paper leans on
-
[1]
URL https://aclanthology.org/2020. acl-main.485/. Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H. Stereotyping Norwegian salmon: An in- ventory of pitfalls in fairness benchmark datasets. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.),Pro- ceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics and th...
-
[2]
URL https://aclanthology.org/2021. naacl-main.385/. Campbell, D. T. and Fiske, D. W. Convergent and discriminant validation by the multitrait-multimethod matrix.Psychological Bulletin, 56(2):81–105, 1959. ISSN 0033-2909 (Print), 1939-1455 (Electronic). doi: 10.1037/h0046016. URL https://doi.org/10. 1037/h0046016. Chiang, W.-L., Zheng, L., Sheng, Y ., Ange...
-
[3]
URL https://aclanthology.org/2023. emnlp-main.699/. Cook, L. L. and Pitoniak, M. J. (eds.).Educational Measurement. Oxford University Press, 5 edition,
work page 2023
-
[4]
URL https: //global.oup.com/academic/product/ educational-measurement-9780197654965
ISBN 978-0-19-765496-5. URL https: //global.oup.com/academic/product/ educational-measurement-9780197654965. Cronbach, L. J. and Meehl, P. E. Construct validity in psy- chological tests.Psychological Bulletin, 52(4):281–302, 9 Position: Science of AI Evaluation Requires Item-level Benchmark Data
-
[5]
doi: https://doi.org/10.1037/h0040957
ISSN 1939-1455 (Electronic); 0033-2909 (Print). doi: https://doi.org/10.1037/h0040957. 60 references. (PsycInfo Database Record (c) 2025 APA, all rights re- served). Dehghani, M., Tay, Y ., Gritsenko, A. A., Zhao, Z., Houlsby, N., Diaz, F., Metzler, D., and Vinyals, O. The bench- mark lottery, 2021. URL https://arxiv.org/ abs/2107.07002. Deveci, ˙I. E. an...
-
[6]
URL https://openreview.net/forum? id=0zDiyIGCFT. Dongarra, J. J., Moler, C. B., Bunch, J. R., and Stewart, G. W.LINPACK Users’ Guide. Society for Indus- trial and Applied Mathematics, 1979. doi: 10.1137/ 1.9781611971811. URL https://epubs.siam. org/doi/abs/10.1137/1.9781611971811. Du, M., Manjunatha, V ., Jain, R., Deshpande, R., Dernon- court, F., Gu, J....
-
[7]
URL https://aclanthology.org/2020. findings-emnlp.301/. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., and Jacobsen, H.-A. Bigbench: towards an in- dustry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pp. 1197–1208, New York, NY , USA, 2013. As...
-
[8]
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J
URL https://openreview.net/forum? id=sAFottNlra. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URL https:// openreview.net/forum?id=d7KBjmI3GmQ. Henrysson, S. Correction of item-total correlations in ...
work page 2021
-
[9]
Jacovi, A., Caciularu, A., Goldman, O., and Goldberg, Y
URL https://openreview.net/forum? id=R0c2qtalgG. Jacovi, A., Caciularu, A., Goldman, O., and Goldberg, Y . Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation bench- marks. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Pr...
-
[10]
URL https://aclanthology.org/2023. emnlp-main.308/. Jiang, H., Yi, X., Wei, Z., Xiao, Z., Wang, S., and Xie, X. Raising the bar: Investigating the values of large language models via generative evolving testing. InForty- second International Conference on Machine Learning, 2025a. URL https://openreview.net/forum? id=0REM9ydeLZ. Jiang, X., Chang, D., and X...
-
[11]
URL https: //aclanthology.org/2025.bea-1.69/
doi: 10.18653/v1/2025.bea-1.69. URL https: //aclanthology.org/2025.bea-1.69/. Le Bras, R., Swayamdipta, S., Bhagavatula, C., Zellers, R., Peters, M. E., Sabharwal, A., and Choi, Y . Adversarial filters of dataset biases. InProceedings of the 37th In- ternational Conference on Machine Learning, ICML’20. JMLR.org, 2020. Li, F., Hogg, D. C., and Cohn, A. G. ...
-
[12]
Featured Certification, Expert Certi- fication, Outstanding Certification
URL https://openreview.net/forum? id=iO4LZibEqW. Featured Certification, Expert Certi- fication, Outstanding Certification. Liao, Q. V . and Xiao, Z. Rethinking model evaluation as narrowing the socio-technical gap, 2025. URL https: //arxiv.org/abs/2306.03100. Lin, B. Y ., Deng, Y ., Chandu, K., Brahman, F., Ravichan- der, A., Pyatkin, V ., Dziri, N., Bra...
-
[13]
URL https://aclanthology.org/2023. eacl-main.148/. Nahum, O., Calderon, N., Keller, O., Szpektor, I., and Re- ichart, R. Are LLMs better than reported? detecting label errors and mitigating their effect on model perfor- mance. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Confer- ence on Empirical Metho...
work page 2023
-
[14]
Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B
Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main
-
[15]
URL https://aclanthology.org/2025. emnlp-main.1360/. Ni, J., Xue, F., Yue, X., Deng, Y ., Shah, M., Jain, K., Neubig, G., and You, Y . Mixeval: Deriving wisdom of the crowd from LLM benchmark mixtures. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/ forum?id=6A29LUZhfv. Nie, Y ., Williams, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.acl-main.441 2025
-
[16]
URL https://aclanthology.org/2021. naacl-main.200/. Polo, F. M., Weber, L., Choshen, L., Sun, Y ., Xu, G., and Yurochkin, M. tinybenchmarks: evaluating llms with fewer examples. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. Raji, D., Denton, E., Bender, E. M., Hanna, A., and Paullada, A. Ai and the everyt...
work page 2021
-
[17]
URL https://openreview.net/forum? id=hcOq2buakM. Salaudeen, O. E., Reuel, A., Ahmed, A. M., Bedi, S., Robert- son, Z., Sundar, S., Domingue, B. W., Wang, A., and Koyejo, S. Measurement to meaning: A validity-centered framework for AI evaluation. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, ...
work page 2025
-
[18]
URL https://openreview.net/forum? id=Rxd2TpV6Eg. Son, H. S. Validity evaluation for the data used for artificial intelligence system. In Bi, Y ., Bhatia, R., and Kapoor, S. (eds.),Intelligent Systems and Applications, pp. 362–369, Cham, 2020. Springer International Publishing. ISBN 978-3-030-29516-5. S¨uhr, T., Dorner, F. E., Salaudeen, O., Kelava, A., an...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[19]
URL https: //doi.org/10.1037/0022-3514.84.3.608
doi: 10.1037/0022-3514.84.3.608. URL https: //doi.org/10.1037/0022-3514.84.3.608. White, C., Dooley, S., Roberts, M., Pal, A., Feuer, B., Jain, S., Shwartz-Ziv, R., Jain, N., Saifullah, K., Dey, S., Shubh-Agrawal, Sandha, S. S., Naidu, S. V ., Hegde, C., LeCun, Y ., Goldstein, T., Neiswanger, W., and Goldblum, M. Livebench: A challenging, contamination-li...
-
[20]
URL https://aclanthology.org/2025. emnlp-main.1173/. Xu, X., Wu, Z., Qiao, R., Verma, A., Shu, Y ., Wang, J., Niu, X., He, Z., Chen, J., Zhou, Z., Lau, G. K. R., Dao, H., Agussurja, L., Sim, R. H. L., Lin, X., Hu, W., 15 Position: Science of AI Evaluation Requires Item-level Benchmark Data Dai, Z., Koh, P. W., and Low, B. K. H. Position pa- per: Data-cent...
-
[21]
URL https://aclanthology.org/2024. findings-emnlp.695/. Xu, Z., Xie, S., Lv, Q., Xiao, S., Song, L., Wenjuan, S., and Lin, F. Diagnosing failures in large language models’ answers: Integrating error attribution into eval- uation framework. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Findings of the As- sociation for Computational Ling...
-
[22]
A Survey of Large Language Models
URL https://aclanthology.org/2025. findings-acl.1089/. Yao, J., Jin, P., Bao, K., Yu, Q., Bhardwaj, K., Su, C., Wang, J., ZHU, Y ., Devare, S., Mosk-Aoyama, D., Dong, Z., Srinivasan, V . K., Zhang, Y ., Kuchaiev, O., Jiao, J., and Zhu, B. The measure of all measures: Quantifying LLM benchmark quality. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.