pith. sign in

arxiv: 2509.19590 · v2 · pith:FOGBG7G5new · submitted 2025-09-23 · 💻 cs.AI · cs.CY· cs.LG

Position: AI Evaluations Should be Grounded on a Theory of Capability

Pith reviewed 2026-05-21 21:53 UTC · model grok-4.3

classification 💻 cs.AI cs.CYcs.LG
keywords AI evaluationtheory of capabilitylatent constructbenchmarkinginference tasksgenerative modelsevaluation practices
0
0 comments X

The pith

AI evaluations should be framed as inference tasks grounded on an explicit theory of capability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current AI evaluations often present benchmark scores as direct measures of a model's capabilities, but these scores are actually inferences that rely on unexamined assumptions about what it means to be capable. The paper contends that evaluations need to be grounded in an explicit theory of capability, similar to how psychometrics models latent traits. The authors provide empirical evidence that different modeling assumptions can lead to very different conclusions about performance. They also introduce an Evaluation Card as a practical tool to make these assumptions transparent and open to scrutiny.

Core claim

The central claim is that AI evaluations are inferences rather than direct measurements, and that without an explicit theory of capability as a latent construct, the reliability of benchmark results cannot be properly assessed. By showing that performance reports depend strongly on the choice of modeling assumptions, the paper demonstrates the importance of making those assumptions explicit in AI contexts.

What carries the argument

Framing AI evaluation as an inference task that requires an explicit theory of capability as a latent variable.

Load-bearing premise

The assumption that a model's underlying capability is a hidden trait that requires specific modeling choices to connect it to observed test scores, just as in psychological testing.

What would settle it

A study that applies multiple different theories of capability to the same set of AI models and finds that the inferred capabilities remain unchanged would falsify the claim that the theory matters critically.

Figures

Figures reproduced from arXiv: 2509.19590 by Ashia Wilson, Nathanael Jo.

Figure 1
Figure 1. Figure 1: Diagram of our proposed framework. 1 INTRODUCTION Evaluations (from hereon, “evals”) of generative models have become ubiquitous as a way to probe each models’ capabilities or harms. Companies developing large language models (LLMs) routinely assess their systems’ intelligence using standardized knowledge tasks, while research papers proposing new methods often conduct comparative evaluations against state… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Systematic bias between estimates of performance based on the original benchmark data ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Same as Figure 2 but with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Estimates of accuracy using CBA (Alg. 1, and (b) Estimates of ability using LAAT (Alg. 2), on [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Systematic bias between estimates of accuracy based on the original benchmark data and estimate of [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean absolute distance M, quantifying the expected deviation in performance for a new ques￾tion/prompt from the benchmark distribution. Results are over all eight benchmark tasks, tested on seven LLMs. soft regex rules tailored to the question type. Note that we do not use LLM as a judge because the questions all present multiple answer choices, and thus answers were relatively easy to extract from raw out… view at source ↗
Figure 7
Figure 7. Figure 7: Estimated accuracies with bootstrap confidence intervals, over the original benchmark [top] and over [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Estimates of accuracy using CBA (Algorithm 1) on eight benchmark tasks and over seven open-source [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Estimates of ability using LAAT (Algorithm 2) on eight benchmark tasks and over seven open-source [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
read the original abstract

Evaluations of generative models are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet skepticism about their reliability continues to grow. How can we know that a reported accuracy genuinely reflects a model's underlying performance? Although benchmark results are often presented as direct measurements of capability, in practice they are inferences: treating a score as evidence of capability already presupposes a theory of what it means to be capable at a task. We argue that AI evaluations should instead be framed as inference tasks grounded on an explicit theory of capability. While this perspective is standard in fields like psychometrics, it remains underdeveloped in AI evaluation, where core assumptions are often left implicit. As a proof-of-concept, we empirically show that reported performance can depend strongly on the evaluator's modeling assumptions, underscoring the need for transparent, theory-driven evaluation practices. We conclude by offering an Evaluation Card to help researchers document, justify, and scrutinize the modeling decisions underlying AI evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript argues that evaluations of generative AI models are inferences rather than direct measurements because they presuppose an implicit theory of capability. It advocates reframing AI evaluations as explicit inference tasks grounded in a theory of capability, drawing from psychometrics practices. As a proof-of-concept, the paper empirically demonstrates that reported performance can depend strongly on the evaluator's modeling assumptions, and it concludes by proposing an Evaluation Card to document, justify, and scrutinize these modeling decisions.

Significance. If the central claim holds, this position paper could meaningfully advance AI evaluation practices by promoting transparency around modeling assumptions and reducing overinterpretation of benchmark scores. The proof-of-concept illustration of assumption sensitivity is a timely contribution that aligns with growing skepticism about benchmark reliability. Strengths include the clear logical framing and the practical Evaluation Card tool, though the empirical support remains illustrative rather than a full validation of improved inference accuracy.

major comments (1)
  1. [Proof-of-concept experiment] Proof-of-concept section: the demonstration that different modeling assumptions produce different performance estimates establishes sensitivity to assumptions but does not test whether a psychometrics-derived latent-trait model yields more valid or predictive inferences about underlying capability than simpler alternatives such as standard error bars, existing item-response adjustments in NLP benchmarks, or task-specific causal models. This comparison is load-bearing for the claim that explicit theory-grounded evaluation improves upon current practices.
minor comments (3)
  1. [Abstract] The abstract could more explicitly preview the structure of the Evaluation Card and its intended use cases.
  2. [Background on psychometrics] Some citations to foundational psychometrics references (e.g., on local independence or unidimensionality assumptions) would help readers from AI backgrounds follow the transfer argument.
  3. [Empirical results] Figure captions in the empirical section should include error bars or confidence intervals to clarify the magnitude of assumption-driven variation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our position paper. We respond to the major comment below, clarifying the scope of our contribution while acknowledging the limits of the current empirical illustration.

read point-by-point responses
  1. Referee: Proof-of-concept section: the demonstration that different modeling assumptions produce different performance estimates establishes sensitivity to assumptions but does not test whether a psychometrics-derived latent-trait model yields more valid or predictive inferences about underlying capability than simpler alternatives such as standard error bars, existing item-response adjustments in NLP benchmarks, or task-specific causal models. This comparison is load-bearing for the claim that explicit theory-grounded evaluation improves upon current practices.

    Authors: We agree that the proof-of-concept establishes sensitivity to modeling assumptions without directly comparing the validity or predictive power of a latent-trait model against alternatives such as standard error bars or existing item-response theory adjustments. As a position paper, our central claim is that AI evaluations are inferences that already presuppose some (often implicit) theory of capability, and that making this theory explicit enables better scrutiny and transparency. The empirical demonstration is intended to illustrate the practical consequences of differing assumptions rather than to validate any specific modeling framework as superior. We do not claim in the manuscript that a psychometrics-derived approach is empirically better than the listed alternatives; instead, we argue that current practices would benefit from explicit documentation of whatever theory is being used. We will revise the manuscript to more explicitly state the illustrative purpose of the experiment, to avoid any implication of validated superiority, and to identify comparative validation studies as an important direction for future work. This revision addresses the referee's concern by sharpening the framing without expanding the paper's scope beyond a position piece. revision: partial

Circularity Check

0 steps flagged

No circularity: conceptual argument and illustrative demo are independent of inputs

full rationale

The paper advances a position that benchmark scores are inferences presupposing an implicit theory of capability, advocating explicit modeling drawn from psychometrics. This rests on logical analysis of evaluation practices rather than any derivation, equation, or fit. The proof-of-concept empirically illustrates that performance numbers vary with modeling assumptions; this is a direct consequence of changing the assumptions and does not reduce to a fitted parameter renamed as prediction or any self-referential construction. No self-citations are load-bearing, no uniqueness theorems are imported from the authors' prior work, and no ansatz is smuggled in. The Evaluation Card is a documentation template, not a derived result. The chain is self-contained against external benchmarks from psychometrics and standard evaluation critique.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that capability is a latent variable best interpreted through explicit modeling choices, drawn from psychometrics without new free parameters or invented entities.

axioms (1)
  • domain assumption Benchmark scores are inferences that already presuppose a theory of capability
    Explicitly stated in the abstract as the starting point for the argument.

pith-pipeline@v0.9.0 · 5697 in / 1121 out tokens · 50075 ms · 2026-05-21T21:53:09.826185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 6 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    When benchmarks are targets: Revealing the sensitivity of large language model leaderboards

    Norah Alzahrani, Hisham Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, et al. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguisti...

  4. [4]

    fl-irt-ing with psychometrics to improve nlp bias measurement

    Dominik Bachmann, Oskar van der Wal, Edita Chvojka, Willem H Zuidema, Leendert van Maanen, and Katrin Schulz. fl-irt-ing with psychometrics to improve nlp bias measurement. Minds and Machines, 34 0 (4): 0 37, 2024

  5. [5]

    The basics of item response theory

    Frank B Baker. The basics of item response theory. ERIC, 2001

  6. [6]

    Item response theory: Parameter estimation techniques

    Frank B Baker and Seock-Ho Kim. Item response theory: Parameter estimation techniques. CRC press, 2004

  7. [7]

    Some asymptotic theory for the bootstrap

    Peter J Bickel and David A Freedman. Some asymptotic theory for the bootstrap. The annals of statistics, 9 0 (6): 0 1196--1217, 1981

  8. [8]

    Generalizability theory

    Robert L Brennan. Generalizability theory. In The history of educational measurement, pp.\ 206--231. Routledge, 2021

  9. [9]

    Statistical inference

    George Casella and Roger Berger. Statistical inference. CRC press, 2024

  10. [10]

    Adversarial robustness for machine learning

    Pin-Yu Chen and Cho-Jui Hsieh. Adversarial robustness for machine learning. Academic Press, 2022

  11. [11]

    Chatbot arena: An open platform for evaluating llms by human preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning, 2024

  12. [12]

    On the Measure of Intelligence

    Fran c ois Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019

  13. [13]

    How scores are calculated, 2025

    College Board . How scores are calculated, 2025. URL https://satsuite.collegeboard.org/scores/what-scores-mean/how-scores-calculated. Accessed: 2025-05-14

  14. [14]

    General- ization or memorization: Data contamination and trustworthy evaluation for large language models, 2024

    Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. arXiv preprint arXiv:2402.15938, 2024

  15. [15]

    Robustness challenges of large language models in natural language understanding: A survey, 2022

    Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and Xia Hu. Robustness challenges of large language models in natural language understanding: A survey, 2022

  16. [16]

    Lmentry: A language model benchmark of elementary language tasks

    Avia Efrat, Or Honovich, and Omer Levy. Lmentry: A language model benchmark of elementary language tasks. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 10476--10501, 2023

  17. [17]

    Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation

    Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation. arXiv preprint arXiv:2502.06559, 2025

  18. [18]

    What did i do wrong? quantifying llms' sensitivity and consistency to prompt engineering

    Federico Errica, Giuseppe Siracusano, Davide Sanvito, and Roberto Bifulco. What did i do wrong? quantifying llms' sensitivity and consistency to prompt engineering. arXiv preprint arXiv:2406.12334, 2024

  19. [19]

    Bootstrapping clustered data

    Christopher A Field and Alan H Welsh. Bootstrapping clustered data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69 0 (3): 0 369--390, 2007

  20. [20]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  21. [21]

    The emerging science of machine learning benchmarks

    Moritz Hardt. The emerging science of machine learning benchmarks. Online at https://mlbenchmarks.org, 2025. Manuscript

  22. [22]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

  23. [23]

    General intelligence disentangled via a generality metric for natural and artificial intelligence

    Jos \'e Hern \'a ndez-Orallo, Bao Sheng Loe, Lucy Cheke, Fernando Mart \' nez-Plumed, and Se \'a n \'O h \'E igeartaigh. General intelligence disentangled via a generality metric for natural and artificial intelligence. Scientific reports, 11 0 (1): 0 22822, 2021

  24. [24]

    Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement? Intelligence, 106: 0 101858, 2024

    David Ili \'c and Gilles E Gignac. Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement? Intelligence, 106: 0 101858, 2024

  25. [25]

    Robust prompt optimization for large language models against distribution shifts

    Moxin Li, Wenjie Wang, Fuli Feng, Yixin Cao, Jizhi Zhang, and Tat-Seng Chua. Robust prompt optimization for large language models against distribution shifts. arXiv preprint arXiv:2305.13954, 2023

  26. [26]

    Holistic evaluation of language models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=iO4LZibEqW. Featured Certification, Expert Certification, Outs...

  27. [27]

    T ruthful QA : Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. T ruthful QA : Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3214--3252, Dublin, Ireland, May 2022. Association for Computatio...

  28. [28]

    Statistical theories of mental test scores

    Frederic M Lord and Melvin R Novick. Statistical theories of mental test scores. IAP, 2008

  29. [29]

    tiny B enchmarks: evaluating LLM s with fewer examples

    Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tiny B enchmarks: evaluating LLM s with fewer examples. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volu...

  30. [30]

    Adding error bars to evals: A statistical approach to language model evaluations

    Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations. arXiv preprint arXiv:2411.00640, 2024

  31. [31]

    How do we know how smart ai systems are?, 2023

    Melanie Mitchell. How do we know how smart ai systems are?, 2023

  32. [32]

    Debates on the nature of artificial general intelligence, 2024

    Melanie Mitchell. Debates on the nature of artificial general intelligence, 2024

  33. [33]

    State of what art? a call for multi-prompt llm evaluation

    Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? a call for multi-prompt llm evaluation. Transactions of the Association for Computational Linguistics, 12: 0 933--949, 2024

  34. [34]

    Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena

    Aidar Myrzakhan, Sondos Mahmoud Bsharat, and Zhiqiang Shen. Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena. arXiv preprint arXiv:2406.07545, 2024

  35. [35]

    On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection

    Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In Breakthroughs in statistics: Methodology and distribution, pp.\ 123--150. Springer, 1992

  36. [36]

    Numerical optimization

    Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999

  37. [37]

    Evaluation metrics and statistical tests for machine learning

    Oona Rainio, Jarmo Teuho, and Riku Kl \'e n. Evaluation metrics and statistical tests for machine learning. Scientific Reports, 14 0 (1): 0 6086, 2024

  38. [38]

    Bender, Alex Hanna, and Amandalynne Paullada

    Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada. Ai and the everything in the whole wide world benchmark. In J. Vanschoren and S. Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper...

  39. [39]

    Introduction to psychometric theory

    Tenko Raykov and George A Marcoulides. Introduction to psychometric theory. Routledge, 2011

  40. [40]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

  41. [41]

    Nonparametric bootstrapping for hierarchical data

    Shiquan Ren, Hong Lai, Wenjing Tong, Mostafa Aminzadeh, Xuezhang Hou, and Shenghan Lai. Nonparametric bootstrapping for hierarchical data. Journal of Applied Statistics, 37 0 (9): 0 1487--1498, 2010

  42. [42]

    Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P Lalor, Robin Jia, and Jordan Boyd-Graber. Evaluation examples are not equally informative: How should that change nlp leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processi...

  43. [43]

    Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023

  44. [44]

    Evaluation data contamination in llms: how do we measure it and (when) does it matter? arXiv preprint arXiv:2411.03923, 2024

    Aaditya K Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David Esiobu, Maria Lomeli, Gergely Szilvasy, and Dieuwke Hupkes. Evaluation data contamination in llms: how do we measure it and (when) does it matter? arXiv preprint arXiv:2411.03923, 2024

  45. [45]

    Examining the robustness of llm evaluation to the distributional assumptions of benchmarks

    Charlotte Siska, Katerina Marazopoulou, Melissa Ailem, and James Bono. Examining the robustness of llm evaluation to the distributional assumptions of benchmarks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 10406--10421, 2024

  46. [46]

    Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, et al

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/...

  47. [47]

    Challenging big-bench tasks and whether chain-of-thought can solve them

    Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 13003--13051, 2023

  48. [48]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  49. [49]

    An intellectual history of parametric item response theory models in the twentieth century

    David Thissen and Lynne Steinberg. An intellectual history of parametric item response theory models in the twentieth century. Chinese/English Journal of Educational Measurement and Evaluation, 1 0 (1): 0 5, 2020

  50. [50]

    Comparing test sets with item response theory

    Clara Vania, Phu Mon Htut, William Huang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, and Samuel R Bowman. Comparing test sets with item response theory. arXiv preprint arXiv:2106.00840, 2021

  51. [51]

    Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025

    Joshua Vendrow, Edward Vendrow, Sara Beery, and Aleksander Madry. Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025

  52. [52]

    Evaluating general-purpose ai with psychometrics

    Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell, Luning Sun, Fang Luo, and Xing Xie. Evaluating general-purpose ai with psychometrics. arXiv preprint arXiv:2310.16379, 2023

  53. [53]

    Cognitive diagnostic models and how they can be useful

    Joanna Williamson. Cognitive diagnostic models and how they can be useful. research report. Cambridge University Press & Assessment, 2023

  54. [54]

    Improving the robustness of large language models via consistency alignment

    Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Shuaiqiang Wang, Chong Meng, Zhicong Cheng, Zhaochun Ren, and Dawei Yin. Improving the robustness of large language models via consistency alignment. arXiv preprint arXiv:2403.14221, 2024

  55. [55]

    Large language models are not robust multiple choice selectors

    Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. arXiv preprint arXiv:2309.03882, 2023

  56. [56]

    AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023

  57. [57]

    From static benchmarks to adaptive testing: Psychometrics in ai evaluation

    Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Zachary A Pardos, Patrick C Kyllonen, Jiyun Zu, Qingyang Mao, Rui Lv, Zhenya Huang, et al. From static benchmarks to adaptive testing: Psychometrics in ai evaluation. arXiv preprint arXiv:2306.10512, 2023

  58. [58]

    Prosa: Assessing and understanding the prompt sensitivity of llms

    Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. Prosa: Assessing and understanding the prompt sensitivity of llms. arXiv preprint arXiv:2410.12405, 2024

  59. [59]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  60. [60]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  61. [61]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...