pith. machine review for the scientific record. sign in

arxiv: 2605.14164 · v1 · submitted 2026-05-13 · 💻 cs.AI

Recognition: no theorem link

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:51 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI benchmarksmodel evaluationbenchmark selectionfoundation modelsgenerative AIevaluation practicesindustry analysis
0
0 comments X

The pith

AI builders select benchmarks to fit marketing narratives rather than enable consistent scientific comparison.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects 231 benchmarks highlighted in 139 model releases from 11 major AI builders in 2025 and maps their claimed purposes. It finds that 63.2 percent of these benchmarks appear for only one builder and 38.5 percent appear in only one release, producing little overlap for direct comparison across models. The authors create a taxonomy that translates each builder's wording into a shared set of measured signals, showing that categories such as general knowledge application are applied to narrow STEM tasks yet presented as evidence of broader progress. The central argument is that companies treat benchmark results as flexible storytelling devices to support competitive positioning instead of as fixed instruments for reproducible evaluation.

Core claim

Highlighted benchmarks in 2025 model releases form a fragmented set where most tests are used by a single builder, the same test receives different competency attributions from different companies, and many results are framed as indicators of AGI progress even when the underlying tasks remain limited to specific STEM domains.

What carries the argument

The Benchmarking-Cultures-25 dataset of 231 highlighted benchmarks paired with a unified taxonomy that converts each builder's terminology into common categories of measured signals.

If this is right

  • Direct head-to-head comparisons of model capabilities become unreliable because few benchmarks are shared.
  • Progress claims toward general intelligence rest on loosely defined categories that mostly test math and science tasks.
  • Reproducible scientific tracking of model improvement is hindered when selection favors narrative fit over standardization.
  • Public understanding of state-of-the-art performance depends on whichever tests each company chooses to emphasize.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Regulators seeking standardized AI assessment may need to create independent benchmark suites outside company control.
  • The same selection logic could appear in other fast-moving technical fields where public perception drives investment.
  • Long-term research agendas may shift toward tasks that resist easy narrative framing if external pressure for comparability grows.

Load-bearing premise

The 139 releases and 231 highlighted benchmarks chosen by the 11 builders capture the main evaluation practices used across the AI industry in 2025.

What would settle it

A complete census of 2025 model releases showing that most builders share and interpret the same small set of benchmarks in the same way would contradict the reported fragmentation and narrative flexibility.

Figures

Figures reproduced from arXiv: 2605.14164 by Christo Buschek, Maty Bohacek, Stefan Baack.

Figure 1
Figure 1. Figure 1: Prescribed Competencies by Model Builders Within The Top 5 "Coding" Benchmarks. This graph shows the count of competency categories that model publishers prescribe to benchmarks across model releases. Model builders inconsistently label the same benchmarks to represent different competencies across releases, even between model releases by the same organization. Shown in [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 2
Figure 2. Figure 2: Adoption of Benchmarks Released in 2025. The top five most adopted models are highlighted for clarity [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Highlight Frequency of Selected Competencies by Model Builders. This graph shows the trend of these selected competencies being highlighted in model releases throughout 2025. See [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prescribed Competencies by Model Builders Within The Top Five "Reasoning and knowledge" Benchmarks. This heatmap shows the count of competency categories that model builders prescribe to benchmarks across model releases. MMMLU is excluded as it is a translation of MMLU’s test set. new benchmark for assessing models across a diverse set of subjects that humans learn." [27] Essentially, the benchmark is mean… view at source ↗
Figure 5
Figure 5. Figure 5: Highlights of Competencies by Model Builders. This graph shows the trend of these selected competencies being highlighted in model releases [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Benchmarks View. Ordered by rank, each benchmark record presents its date of publication, assigned categories and models, affiliation distribution, and a paper link [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Benchmarks Visualization. Pictured above is a lollipop chart comparison of affiliation of benchmark creators by year, opened from the Benchmarks View [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Models View. Pictured above is the models view filtered by MMLU-Pro usage. Each model record presents its date of publication, publisher, access policy, affiliation sector and model parameters if available, domain, and the announcement link [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Models Visualization. Pictured above is a grouped bar chart of model access and publisher domain statistics filtered by model publisher sector (Industry), opened from the Models View [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Competencies View. The list contains all tested competencies within our custom taxonomy. Each taxonomy record presents the connected benchmarks, models, and prescribed categories, as well as the definition [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Competencies Visualization. Pictured above is a heatmap chart comparing the competencies that benchmarks are measuring vs. the competencies that model builders prescribe to them, opened from the Competencies View [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
read the original abstract

The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog posts, where model builders highlight results on selected benchmarks. These artifacts now largely define the state of the art for researchers and the public. Despite their prominence, which benchmarks model builders choose to highlight, and what they communicate through this selection, is underexamined. To investigate, we introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data. Our analysis reveals a fragmented evaluation landscape with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder, and 38.5% appear in just one release. Few achieve widespread use (e.g., GPQA Diamond, LiveCodeBench, AIME 2025). Moreover, benchmarks are attributed different competencies by different builders, depending on their narrative. To disentangle these conflicting presentations, we develop a unified taxonomy mapping diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure. "General knowledge application" is the second most popular, yet vaguely defined, category. Qualitative analysis shows many such benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI. Their authors claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math). We argue that highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation. Data: https://hf.co/datasets/matybohacek/benchmarking-cultures-25; tool: https://bench-cultures.net.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the open Benchmarking-Cultures-25 dataset of 231 benchmarks highlighted across 139 model releases from 11 major AI builders in 2025. It reports quantitative patterns of fragmentation (63.2% single-builder benchmarks, 38.5% single-release usage) and inconsistent competency attributions, develops a unified taxonomy mapping author-stated signals, and qualitatively argues that highlighted benchmarks operate primarily as flexible narrative devices for market positioning rather than standardized scientific measurement tools, with many deemphasizing construct validity in favor of AGI-progress framing.

Significance. If the core observations hold, the work supplies a reproducible, open dataset and taxonomy that directly quantifies the lack of cross-model comparability in current AI evaluation practices. The direct counts from the assembled releases provide a solid descriptive foundation, and the interpretive framing offers a useful lens for metascience discussions of how benchmarks shape public and research perceptions of progress.

major comments (2)
  1. [§4] §4 (Qualitative analysis): The central claim that benchmarks function as narrative devices prioritizing market positioning rests on interpretive reading of author statements and framing; the manuscript does not report a systematic coding protocol, inter-rater reliability, or explicit decision rules for classifying 'AGI narrative' emphasis versus construct-validity focus, which weakens the load-bearing interpretive step from the quantitative fragmentation statistics.
  2. [§2] §2 (Data collection): The sample is restricted to 11 major builders and 139 releases; while the counts are internally consistent, the paper does not detail selection criteria or compare against a broader population of releases (e.g., smaller labs or open-source models), leaving the generalizability of the 'dominant evaluation practices' claim under-supported for the industry-wide conclusion.
minor comments (2)
  1. [Table 1] Table 1 or equivalent: the taxonomy categories (e.g., 'general knowledge application') would benefit from one or two concrete benchmark examples per category to illustrate how conflicting attributions were resolved.
  2. [Abstract / §3] The interactive tool at bench-cultures.net is referenced but the manuscript lacks a brief description of key exploratory features or example queries that readers can use to reproduce the reported statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Qualitative analysis): The central claim that benchmarks function as narrative devices prioritizing market positioning rests on interpretive reading of author statements and framing; the manuscript does not report a systematic coding protocol, inter-rater reliability, or explicit decision rules for classifying 'AGI narrative' emphasis versus construct-validity focus, which weakens the load-bearing interpretive step from the quantitative fragmentation statistics.

    Authors: We agree that greater methodological transparency would strengthen the qualitative section. The interpretations draw directly from quoted statements in the model releases, but we acknowledge the absence of a documented protocol. In the revised manuscript we will insert a new subsection in §4 that (a) lists the explicit decision rules used to flag AGI-progress framing versus construct-validity emphasis, (b) provides representative examples of each category and borderline cases, and (c) describes the author-led review process used to resolve disagreements. While we will not retroactively compute inter-rater reliability statistics, the added documentation will make the interpretive step reproducible and address the concern without changing the substantive claims. revision: yes

  2. Referee: [§2] §2 (Data collection): The sample is restricted to 11 major builders and 139 releases; while the counts are internally consistent, the paper does not detail selection criteria or compare against a broader population of releases (e.g., smaller labs or open-source models), leaving the generalizability of the 'dominant evaluation practices' claim under-supported for the industry-wide conclusion.

    Authors: The referee is correct that explicit selection criteria were not stated. We chose the 11 builders because they account for the releases that most directly shape public discourse and academic citations; however, this rationale should be documented. In revision we will (a) list the precise inclusion rule (builders that released at least one foundation or generative model in 2025 and issued public benchmark-highlighting materials), (b) name the 11 builders, and (c) add a limitations paragraph noting that practices among smaller labs or purely open-source projects may differ and that our conclusions apply specifically to the dominant industry actors. We will not expand the dataset to include those additional actors, as that would require a separate study. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims derive from direct empirical counts (63.2% single-builder benchmarks, 38.5% single-release usage) and a taxonomy explicitly grounded in benchmark authors' own stated claims about measured signals. No equations, fitted parameters, predictions, or self-citations appear in the derivation chain that reduce findings to inputs by construction. The dataset Benchmarking-Cultures-25 is independently assembled from public releases, and the narrative-device interpretation follows from observable patterns without hidden reductions or load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on two domain assumptions about data representativeness and the interpretive validity of company framing; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Press releases and company blog posts constitute the primary artifacts that now define the state of the art for foundation-model evaluation.
    The study treats these sources as the operative evaluation record rather than peer-reviewed papers.
  • domain assumption The 139 releases from 11 major builders are representative of current benchmarking culture.
    Scope is limited to these builders; generalizability to smaller or non-public labs is not tested.

pith-pipeline@v0.9.0 · 5617 in / 1297 out tokens · 35178 ms · 2026-05-15T04:51:28.083752+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 10 internal anchors

  1. [1]

    Mohamed Abdalla and Moustafa Abdalla. 2021. The Grey Hoodie Project: Big Tobacco, Big Tech, and the Threat on Academic Integrity. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society(2021-07-21). 287–297. arXiv:2009.13676 [cs] doi:10.1145/3461702.3462563

  2. [2]

    Norah Alzahrani, Hisham Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, et al. 2024. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. InProceedings of the 62nd Annual Meeting of the Association for Computational Ling...

  3. [3]

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. Concrete Problems in AI Safety. arXiv:1606.06565 [cs] doi:10.48550/arXiv.1606.06565

  4. [4]

    Anthropic. 2025. Claude 3.7 Sonnet System Card

  5. [5]

    Andrew M Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, et al. 2025. Measuring what Matters: Construct Validity in Large Language Model Benchmarks. arXiv preprint arXiv:2511.04703(2025)

  6. [6]

    Borhane Blili-Hamelin, Christopher Graziul, Leif Hancox-Li, Hananel Hazan, El-Mahdi El-Mhamdi, Avijit Ghosh, Katherine A Heller, Jacob Metcalf, Fabricio Murai, Eryk Salvaggio, et al . [n. d.]. Position: Stop treating AGI as the north-star goal of AI research. In Forty-second International Conference on Machine Learning Position Paper Track

  7. [7]

    Matyas Bohacek, Nino Scherrer, Nicholas Dufour, Thomas Leung, Christoph Bregler, and Stephanie CY Chan. 2025. Uncovering Competency Gaps in Large Language Models and Their Benchmarks.arXiv preprint arXiv:2512.20638(2025)

  8. [8]

    Rishi Bommasani. 2021. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258(2021)

  9. [9]

    Rishi Bommasani, Kevin Klyman, Sayash Kapoor, Shayne Longpre, Betty Xiong, Nestor Maslej, and Percy Liang. 2024. The 2024 Foundation Model Transparency Index.arXiv preprint arXiv:2407.12929(2024)

  10. [10]

    Rishi Bommasani, Percy Liang, and Tony Lee. 2023. Holistic evaluation of language models.Annals of the New York Academy of Sciences 1525, 1 (2023), 140–146

  11. [11]

    Samuel Bowman and George Dahl. 2021. What will it take to fix benchmarking in natural language understanding?. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4843–4855

  12. [12]

    Alexander Campolo. 2025. State-of-the-Art: The Temporal Order of Benchmarking Culture.Digital Society4, 2 (2025), 35

  13. [13]

    María Victoria Carro, Denise Alejandra Mester, Francisca Gauna Selasco, Luca Nicolás Forziati Gangi, Matheo Sandleris Musa, Lola Ramos Pereyra, Mario Leiva, Juan Gustavo Corvalan, María Vanina Martinez, and Gerardo Simari. 2025. A Conceptual Framework for AI Capability Evaluations.arXiv preprint arXiv:2506.18213(2025)

  14. [14]

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology15, 3 (2024), 1–45

  15. [15]

    Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, et al . 2025. Benchmarking large language models under data contamination: A survey from static to dynamic evaluation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 10091–10109

  16. [16]

    Yuxing Cheng, Yi Chang, and Yuan Wu. 2025. A survey on data contamination for large language models.arXiv preprint arXiv:2502.14425 (2025)

  17. [17]

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. 2024. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning

  18. [18]

    Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. 2024. Investigating data contamination in modern benchmarks for large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 8706–8719

  19. [19]

    DiMaggio and Walter W

    Paul J. DiMaggio and Walter W. Powell. 1983. The Iron Cage Revisited: Institutional Isomorphism and Collective Rationality in Organizational Fields. 48, 2 (1983), 147–160. jstor:2095101 doi:10.2307/2095101

  20. [20]

    Ricardo Dominguez-Olmedo, Florian E Dorner, and Moritz Hardt. 2024. Training on the test task confounds evaluation and emergence. arXiv preprint arXiv:2407.07890(2024)

  21. [21]

    Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca

  22. [22]

    Can we trust AI benchmarks? an interdisciplinary review of current issues in AI evaluation.arXiv preprint arXiv:2502.06559 (2025)

  23. [23]

    Kawin Ethayarajh and Dan Jurafsky. 2020. Utility is in the eye of the user: A critique of NLP leaderboards.arXiv preprint arXiv:2009.13888 (2020). FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Baack et al

  24. [24]

    James Fodor. 2025. Line goes up? inherent limitations of benchmarks for evaluating large language models.arXiv preprint arXiv:2502.14318 (2025)

  25. [25]

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. 2025. Are we done with mmlu?. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tec...

  26. [26]

    Charles AE Goodhart. 1984. Problems of monetary management: the UK experience. InMonetary theory and practice: The UK experience. Springer, 91–121

  27. [27]

    Jacob Haimes, Cenny Wenner, Kunvar Thaman, Vassil Tashev, Clement Neo, Esben Kran, and Jason Schreiber. 2024. Benchmark inflation: Revealing llm performance gaps using retro-holdouts.arXiv preprint arXiv:2410.09247(2024)

  28. [28]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs] doi:10.48550/arXiv.2009.03300

  29. [29]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica

  30. [30]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. doi:10.48550/ARXIV.2403.07974

  31. [31]

    Ayrton San Joaquin, Rokas Gipiškis, Leon Staufer, and Ariel Gil. 2025. Deprecating Benchmarks: Criteria and Framework.arXiv preprint arXiv:2507.06434(2025)

  32. [32]

    Shaleen Khanal, Hongzhou Zhang, and Araz Taeihagh. 2025. Why and How Is the Power of Big Tech Increasing in the Policy Process? The Case of Generative AI. 44, 1 (2025), 52–69. doi:10.1093/polsoc/puae012

  33. [33]

    Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, et al. 2021. Dynabench: Rethinking benchmarking in NLP. InProceedings of the 2021 conference of the North American chapter of the Association for Computational Linguistics: human language technologies. 4110–4124

  34. [34]

    Bernard Koch, Emily Denton, Alex Hanna, and Jacob G Foster. 2021. Reduced, reused and recycled: The life of a dataset in machine learning research.arXiv preprint arXiv:2112.01716(2021)

  35. [35]

    Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, et al. 2024. A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations.arXiv preprint arXiv:2407.04069(2024)

  36. [36]

    Yucheng Li, Frank Geurin, and Chenghua Lin. 2023. Avoiding data contamination in language model evaluation: Dynamic test construction with latest materials.arXiv preprint arXiv:2312.12343(2023)

  37. [37]

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110(2022)

  38. [38]

    Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. 2021. Are we learning yet? a meta review of evaluation failures across machine learning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  39. [39]

    High Temperature Confinement in SU(N) Gauge Theories

    João C. Magalhães and Rik Smit. 2026. Less Hype, More Drama: Open-Ended Technological Inevitability in Journalistic Discourses About AI in the US, The Netherlands, and Brazil. 14, 2 (2026), 323–340. doi:10.1080/21670811.2025.2522281

  40. [40]

    Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Capstick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, et al. 2025. Artificial intelligence index report 2025.arXiv preprint arXiv:2504.07139(2025)

  41. [41]

    Meredith Ringel Morris, Jascha Sohl-dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. 2024. Levels of AGI for Operationalizing Progress on the Path to AGI. arXiv:2311.02462 [cs] doi:10.48550/arXiv.2311.02462

  42. [42]

    Shiwen Ni, Xiangtao Kong, Chengming Li, Xiping Hu, Ruifeng Xu, Jia Zhu, and Min Yang. 2025. Training on the benchmark is not all you need. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 24948–24956

  43. [43]

    2024.How News Coverage, Often Uncritical, Helps Build up the AI Hype

    Rasmus Kleis Nielsen. 2024.How News Coverage, Often Uncritical, Helps Build up the AI Hype. http://reutersinstitute.politics.ox.ac.uk/ news/how-news-coverage-often-uncritical-helps-build-ai-hype

  44. [44]

    OpenAI. 2023. GPT-4 Research Preview: Capabilities and Limitations

  45. [45]

    OpenAI. 2023. GPT-4 System Card. (2023)

  46. [46]

    OpenAI. 2024. OpenAI o1 System Card

  47. [47]

    Yonatan Oren, Nicole Meister, Niladri S Chatterji, Faisal Ladhak, and Tatsunori Hashimoto. 2023. Proving test set contamination in black-box language models. InThe Twelfth International Conference on Learning Representations

  48. [48]

    Simon Ott, Adriano Barbosa-Silva, Kathrin Blagec, Jan Brauner, and Matthias Samwald. 2022. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications13, 1 (2022), 6793

  49. [49]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes, ...

  50. [50]

    Inioluwa Deborah Raji, Emily M Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. 2021. AI and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366(2021)

  51. [51]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2023. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv:2311.12022 [cs] doi:10.48550/arXiv.2311.12022

  52. [52]

    Kevin Roose. 2025. When A.I. Passes This Test, Look Out. https://www.nytimes.com/2025/01/23/technology/ai-test-humanitys-last- exam.html

  53. [53]

    Eva Sánchez Salido, Julio Gonzalo, and Guillermo Marco. 2025. None of the others: a general technique to distinguish reasoning from memorization in multiple-choice llm evaluation benchmarks.arXiv preprint arXiv:2502.12896(2025)

  54. [54]

    David Sculley, Jasper Snoek, Alex Wiltschko, and Ali Rahimi. 2018. Winner’s curse? On pace, progress, and empirical rigor. (2018). https://openreview.net/forum?id=rJWF0Fywf

  55. [55]

    Harald Semmelrock, Tony Ross-Hellauer, Simone Kopeinik, Dieter Theiler, Armin Haberl, Stefan Thalmann, and Dominik Kowald. 2025. Reproducibility in machine-learning-based research: Overview, barriers, and drivers.AI Magazine46, 2 (2025), e70002

  56. [56]

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on machine learning research(2023)

  57. [57]

    Marilyn Strathern. 1997. ‘Improving ratings’: audit in the British University system.European review5, 3 (1997), 305–321

  58. [58]

    Savannah Thais. 2024. Misrepresented technological solutions in imagined futures: The origins and dangers of ai hype in the research community. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 7. 1455–1465

  59. [59]

    Alexander Wan, Kevin Klyman, Sayash Kapoor, Nestor Maslej, Shayne Longpre, Betty Xiong, Percy Liang, and Rishi Bommasani. 2025. The 2025 Foundation Model Transparency Index.arXiv preprint arXiv:2512.10169(2025)

  60. [60]

    Angelina Wang, Aaron Hertzmann, and Olga Russakovsky. 2024. Benchmark suites instead of leaderboards for evaluating AI fairness. Patterns5, 11 (2024)

  61. [61]

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. arXiv:2406.01574 [cs] doi:10.48550/arXiv.2406.01574

  62. [62]

    Laura Weidinger, Inioluwa Deborah Raji, Hanna Wallach, Margaret Mitchell, Angelina Wang, Olawale Salaudeen, Rishi Bommasani, Deep Ganguli, Sanmi Koyejo, and William Isaac. 2025. Toward an evaluation science for generative AI systems.arXiv preprint arXiv:2503.05336 (2025)

  63. [63]

    Andrew White. 2025. About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong. https://www.futurehouse.org/ research-announcements/hle-exam

  64. [64]

    Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. 2024. On memorization of large language models in logical reasoning.arXiv preprint arXiv:2410.23123(2024)

  65. [65]

    Cheng Xu, Shuhao Guan, Derek Greene, M Kechadi, et al. 2024. Benchmark data contamination of large language models: A survey. arXiv preprint arXiv:2406.04244(2024)

  66. [66]

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2024. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark f...

  67. [67]

    Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, William Song, Tiffany Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, et al. 2024. A careful examination of large language model performance on grade school arithmetic.Advances in Neural Information Processing Systems37 (2024), 46819–46836

  68. [68]

    Hongli Zhou, Hui Huang, Ziqing Zhao, Lvyuan Han, Huicheng Wang, Kehai Chen, Muyun Yang, Wei Bao, Jian Dong, Bing Xu, et al. 2025. Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory.arXiv preprint arXiv:2505.15055 (2025)

  69. [69]

    Everyone Else Does It

    Kyrie Zhixuan Zhou, Justin Eric Chen, Xiang Zheng, Yaoyao Qian, Yunpeng Xiao, and Kai Shu. 2025. "Everyone Else Does It": The Rise of Preprinting Culture in Computing Disciplines.arXiv preprint arXiv:2511.04081(2025). Unsteady Metrics and Benchmarking Cultures of AI Model Builders FAccT ’26, June 25–28, 2026, Montreal, QC, Canada ABenchmarking-Cultures-25...