Position: AI Evaluations Should be Grounded on a Theory of Capability

Ashia Wilson; Nathanael Jo

arxiv: 2509.19590 · v2 · pith:FOGBG7G5new · submitted 2025-09-23 · 💻 cs.AI · cs.CY· cs.LG

Position: AI Evaluations Should be Grounded on a Theory of Capability

Nathanael Jo , Ashia Wilson This is my paper

Pith reviewed 2026-05-21 21:53 UTC · model grok-4.3

classification 💻 cs.AI cs.CYcs.LG

keywords AI evaluationtheory of capabilitylatent constructbenchmarkinginference tasksgenerative modelsevaluation practices

0 comments

The pith

AI evaluations should be framed as inference tasks grounded on an explicit theory of capability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current AI evaluations often present benchmark scores as direct measures of a model's capabilities, but these scores are actually inferences that rely on unexamined assumptions about what it means to be capable. The paper contends that evaluations need to be grounded in an explicit theory of capability, similar to how psychometrics models latent traits. The authors provide empirical evidence that different modeling assumptions can lead to very different conclusions about performance. They also introduce an Evaluation Card as a practical tool to make these assumptions transparent and open to scrutiny.

Core claim

The central claim is that AI evaluations are inferences rather than direct measurements, and that without an explicit theory of capability as a latent construct, the reliability of benchmark results cannot be properly assessed. By showing that performance reports depend strongly on the choice of modeling assumptions, the paper demonstrates the importance of making those assumptions explicit in AI contexts.

What carries the argument

Framing AI evaluation as an inference task that requires an explicit theory of capability as a latent variable.

Load-bearing premise

The assumption that a model's underlying capability is a hidden trait that requires specific modeling choices to connect it to observed test scores, just as in psychological testing.

What would settle it

A study that applies multiple different theories of capability to the same set of AI models and finds that the inferred capabilities remain unchanged would falsify the claim that the theory matters critically.

Figures

Figures reproduced from arXiv: 2509.19590 by Ashia Wilson, Nathanael Jo.

**Figure 1.** Figure 1: Diagram of our proposed framework. 1 INTRODUCTION Evaluations (from hereon, “evals”) of generative models have become ubiquitous as a way to probe each models’ capabilities or harms. Companies developing large language models (LLMs) routinely assess their systems’ intelligence using standardized knowledge tasks, while research papers proposing new methods often conduct comparative evaluations against state… view at source ↗

**Figure 2.** Figure 2: (a) Systematic bias between estimates of performance based on the original benchmark data ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Same as Figure 2 but with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Estimates of accuracy using CBA (Alg. 1, and (b) Estimates of ability using LAAT (Alg. 2), on [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Systematic bias between estimates of accuracy based on the original benchmark data and estimate of [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Mean absolute distance M, quantifying the expected deviation in performance for a new question/prompt from the benchmark distribution. Results are over all eight benchmark tasks, tested on seven LLMs. soft regex rules tailored to the question type. Note that we do not use LLM as a judge because the questions all present multiple answer choices, and thus answers were relatively easy to extract from raw out… view at source ↗

**Figure 7.** Figure 7: Estimated accuracies with bootstrap confidence intervals, over the original benchmark [top] and over [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Estimates of accuracy using CBA (Algorithm 1) on eight benchmark tasks and over seven open-source [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Estimates of ability using LAAT (Algorithm 2) on eight benchmark tasks and over seven open-source [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

read the original abstract

Evaluations of generative models are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet skepticism about their reliability continues to grow. How can we know that a reported accuracy genuinely reflects a model's underlying performance? Although benchmark results are often presented as direct measurements of capability, in practice they are inferences: treating a score as evidence of capability already presupposes a theory of what it means to be capable at a task. We argue that AI evaluations should instead be framed as inference tasks grounded on an explicit theory of capability. While this perspective is standard in fields like psychometrics, it remains underdeveloped in AI evaluation, where core assumptions are often left implicit. As a proof-of-concept, we empirically show that reported performance can depend strongly on the evaluator's modeling assumptions, underscoring the need for transparent, theory-driven evaluation practices. We conclude by offering an Evaluation Card to help researchers document, justify, and scrutinize the modeling decisions underlying AI evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper correctly identifies that benchmark scores depend on implicit assumptions but doesn't sufficiently demonstrate that psychometrics provides the best way to fix it.

read the letter

The main thing here is that AI evaluations are inferences resting on unstated theories of capability, and the authors argue we should make those theories explicit using tools from psychometrics. They support this with a proof-of-concept that shows performance estimates varying with different modeling assumptions and propose an Evaluation Card to help document those choices. This framing is not revolutionary, but applying it directly to generative models and providing a practical template is a solid step. It highlights a real issue in how benchmarks are reported and interpreted, and the card could encourage better habits in the community. The weaker part is the leap to psychometrics without tackling potential mismatches. The stress test note is on point: different assumptions changing numbers doesn't automatically mean a latent trait model is superior to standard statistical corrections or task-specific models. Generative models can behave in ways that break unidimensionality or local independence, and the paper doesn't explore whether the psychometrics approach actually improves validity or prediction over those alternatives. This is for researchers focused on evaluation methodology who are open to cross-field ideas. It won't change the field overnight but offers a structured way to think about the problem. I recommend sending it for peer review. The argument is coherent, the illustration is useful, and the output is actionable. Referees can help strengthen the case for transferability.

Referee Report

1 major / 3 minor

Summary. The manuscript argues that evaluations of generative AI models are inferences rather than direct measurements because they presuppose an implicit theory of capability. It advocates reframing AI evaluations as explicit inference tasks grounded in a theory of capability, drawing from psychometrics practices. As a proof-of-concept, the paper empirically demonstrates that reported performance can depend strongly on the evaluator's modeling assumptions, and it concludes by proposing an Evaluation Card to document, justify, and scrutinize these modeling decisions.

Significance. If the central claim holds, this position paper could meaningfully advance AI evaluation practices by promoting transparency around modeling assumptions and reducing overinterpretation of benchmark scores. The proof-of-concept illustration of assumption sensitivity is a timely contribution that aligns with growing skepticism about benchmark reliability. Strengths include the clear logical framing and the practical Evaluation Card tool, though the empirical support remains illustrative rather than a full validation of improved inference accuracy.

major comments (1)

[Proof-of-concept experiment] Proof-of-concept section: the demonstration that different modeling assumptions produce different performance estimates establishes sensitivity to assumptions but does not test whether a psychometrics-derived latent-trait model yields more valid or predictive inferences about underlying capability than simpler alternatives such as standard error bars, existing item-response adjustments in NLP benchmarks, or task-specific causal models. This comparison is load-bearing for the claim that explicit theory-grounded evaluation improves upon current practices.

minor comments (3)

[Abstract] The abstract could more explicitly preview the structure of the Evaluation Card and its intended use cases.
[Background on psychometrics] Some citations to foundational psychometrics references (e.g., on local independence or unidimensionality assumptions) would help readers from AI backgrounds follow the transfer argument.
[Empirical results] Figure captions in the empirical section should include error bars or confidence intervals to clarify the magnitude of assumption-driven variation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our position paper. We respond to the major comment below, clarifying the scope of our contribution while acknowledging the limits of the current empirical illustration.

read point-by-point responses

Referee: Proof-of-concept section: the demonstration that different modeling assumptions produce different performance estimates establishes sensitivity to assumptions but does not test whether a psychometrics-derived latent-trait model yields more valid or predictive inferences about underlying capability than simpler alternatives such as standard error bars, existing item-response adjustments in NLP benchmarks, or task-specific causal models. This comparison is load-bearing for the claim that explicit theory-grounded evaluation improves upon current practices.

Authors: We agree that the proof-of-concept establishes sensitivity to modeling assumptions without directly comparing the validity or predictive power of a latent-trait model against alternatives such as standard error bars or existing item-response theory adjustments. As a position paper, our central claim is that AI evaluations are inferences that already presuppose some (often implicit) theory of capability, and that making this theory explicit enables better scrutiny and transparency. The empirical demonstration is intended to illustrate the practical consequences of differing assumptions rather than to validate any specific modeling framework as superior. We do not claim in the manuscript that a psychometrics-derived approach is empirically better than the listed alternatives; instead, we argue that current practices would benefit from explicit documentation of whatever theory is being used. We will revise the manuscript to more explicitly state the illustrative purpose of the experiment, to avoid any implication of validated superiority, and to identify comparative validation studies as an important direction for future work. This revision addresses the referee's concern by sharpening the framing without expanding the paper's scope beyond a position piece. revision: partial

Circularity Check

0 steps flagged

No circularity: conceptual argument and illustrative demo are independent of inputs

full rationale

The paper advances a position that benchmark scores are inferences presupposing an implicit theory of capability, advocating explicit modeling drawn from psychometrics. This rests on logical analysis of evaluation practices rather than any derivation, equation, or fit. The proof-of-concept empirically illustrates that performance numbers vary with modeling assumptions; this is a direct consequence of changing the assumptions and does not reduce to a fitted parameter renamed as prediction or any self-referential construction. No self-citations are load-bearing, no uniqueness theorems are imported from the authors' prior work, and no ansatz is smuggled in. The Evaluation Card is a documentation template, not a derived result. The chain is self-contained against external benchmarks from psychometrics and standard evaluation critique.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that capability is a latent variable best interpreted through explicit modeling choices, drawn from psychometrics without new free parameters or invented entities.

axioms (1)

domain assumption Benchmark scores are inferences that already presuppose a theory of capability
Explicitly stated in the abstract as the starting point for the argument.

pith-pipeline@v0.9.0 · 5697 in / 1121 out tokens · 50075 ms · 2026-05-21T21:53:09.826185+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We argue that AI evaluations should instead be framed as inference tasks grounded on an explicit theory of capability... start from a theory of performance, and develop methods for inference from that theory.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ϕi = θi + s(xi) + ϵi ... Assumption 2... clustered bootstrapping... adaptive test based on item response theory

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 6 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

When benchmarks are targets: Revealing the sensitivity of large language model leaderboards

Norah Alzahrani, Hisham Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, et al. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguisti...

work page 2024
[4]

fl-irt-ing with psychometrics to improve nlp bias measurement

Dominik Bachmann, Oskar van der Wal, Edita Chvojka, Willem H Zuidema, Leendert van Maanen, and Katrin Schulz. fl-irt-ing with psychometrics to improve nlp bias measurement. Minds and Machines, 34 0 (4): 0 37, 2024

work page 2024
[5]

The basics of item response theory

Frank B Baker. The basics of item response theory. ERIC, 2001

work page 2001
[6]

Item response theory: Parameter estimation techniques

Frank B Baker and Seock-Ho Kim. Item response theory: Parameter estimation techniques. CRC press, 2004

work page 2004
[7]

Some asymptotic theory for the bootstrap

Peter J Bickel and David A Freedman. Some asymptotic theory for the bootstrap. The annals of statistics, 9 0 (6): 0 1196--1217, 1981

work page 1981
[8]

Generalizability theory

Robert L Brennan. Generalizability theory. In The history of educational measurement, pp.\ 206--231. Routledge, 2021

work page 2021
[9]

Statistical inference

George Casella and Roger Berger. Statistical inference. CRC press, 2024

work page 2024
[10]

Adversarial robustness for machine learning

Pin-Yu Chen and Cho-Jui Hsieh. Adversarial robustness for machine learning. Academic Press, 2022

work page 2022
[11]

Chatbot arena: An open platform for evaluating llms by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[12]

On the Measure of Intelligence

Fran c ois Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[13]

How scores are calculated, 2025

College Board . How scores are calculated, 2025. URL https://satsuite.collegeboard.org/scores/what-scores-mean/how-scores-calculated. Accessed: 2025-05-14

work page 2025
[14]

General- ization or memorization: Data contamination and trustworthy evaluation for large language models, 2024

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. arXiv preprint arXiv:2402.15938, 2024

work page arXiv 2024
[15]

Robustness challenges of large language models in natural language understanding: A survey, 2022

Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and Xia Hu. Robustness challenges of large language models in natural language understanding: A survey, 2022

work page 2022
[16]

Lmentry: A language model benchmark of elementary language tasks

Avia Efrat, Or Honovich, and Omer Levy. Lmentry: A language model benchmark of elementary language tasks. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 10476--10501, 2023

work page 2023
[17]

Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation

Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation. arXiv preprint arXiv:2502.06559, 2025

work page arXiv 2025
[18]

What did i do wrong? quantifying llms' sensitivity and consistency to prompt engineering

Federico Errica, Giuseppe Siracusano, Davide Sanvito, and Roberto Bifulco. What did i do wrong? quantifying llms' sensitivity and consistency to prompt engineering. arXiv preprint arXiv:2406.12334, 2024

work page arXiv 2024
[19]

Bootstrapping clustered data

Christopher A Field and Alan H Welsh. Bootstrapping clustered data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69 0 (3): 0 369--390, 2007

work page 2007
[20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

The emerging science of machine learning benchmarks

Moritz Hardt. The emerging science of machine learning benchmarks. Online at https://mlbenchmarks.org, 2025. Manuscript

work page 2025
[22]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[23]

General intelligence disentangled via a generality metric for natural and artificial intelligence

Jos \'e Hern \'a ndez-Orallo, Bao Sheng Loe, Lucy Cheke, Fernando Mart \' nez-Plumed, and Se \'a n \'O h \'E igeartaigh. General intelligence disentangled via a generality metric for natural and artificial intelligence. Scientific reports, 11 0 (1): 0 22822, 2021

work page 2021
[24]

Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement? Intelligence, 106: 0 101858, 2024

David Ili \'c and Gilles E Gignac. Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement? Intelligence, 106: 0 101858, 2024

work page 2024
[25]

Robust prompt optimization for large language models against distribution shifts

Moxin Li, Wenjie Wang, Fuli Feng, Yixin Cao, Jizhi Zhang, and Tat-Seng Chua. Robust prompt optimization for large language models against distribution shifts. arXiv preprint arXiv:2305.13954, 2023

work page arXiv 2023
[26]

Holistic evaluation of language models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=iO4LZibEqW. Featured Certification, Expert Certification, Outs...

work page 2023
[27]

T ruthful QA : Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. T ruthful QA : Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3214--3252, Dublin, Ireland, May 2022. Association for Computatio...

work page doi:10.18653/v1/2022.acl-long.229 2022
[28]

Statistical theories of mental test scores

Frederic M Lord and Melvin R Novick. Statistical theories of mental test scores. IAP, 2008

work page 2008
[29]

tiny B enchmarks: evaluating LLM s with fewer examples

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tiny B enchmarks: evaluating LLM s with fewer examples. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volu...

work page 2024
[30]

Adding error bars to evals: A statistical approach to language model evaluations

Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations. arXiv preprint arXiv:2411.00640, 2024

work page arXiv 2024
[31]

How do we know how smart ai systems are?, 2023

Melanie Mitchell. How do we know how smart ai systems are?, 2023

work page 2023
[32]

Debates on the nature of artificial general intelligence, 2024

Melanie Mitchell. Debates on the nature of artificial general intelligence, 2024

work page 2024
[33]

State of what art? a call for multi-prompt llm evaluation

Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? a call for multi-prompt llm evaluation. Transactions of the Association for Computational Linguistics, 12: 0 933--949, 2024

work page 2024
[34]

Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena

Aidar Myrzakhan, Sondos Mahmoud Bsharat, and Zhiqiang Shen. Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena. arXiv preprint arXiv:2406.07545, 2024

work page arXiv 2024
[35]

On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection

Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In Breakthroughs in statistics: Methodology and distribution, pp.\ 123--150. Springer, 1992

work page 1992
[36]

Numerical optimization

Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999

work page 1999
[37]

Evaluation metrics and statistical tests for machine learning

Oona Rainio, Jarmo Teuho, and Riku Kl \'e n. Evaluation metrics and statistical tests for machine learning. Scientific Reports, 14 0 (1): 0 6086, 2024

work page 2024
[38]

Bender, Alex Hanna, and Amandalynne Paullada

Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada. Ai and the everything in the whole wide world benchmark. In J. Vanschoren and S. Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper...

work page 2021
[39]

Introduction to psychometric theory

Tenko Raykov and George A Marcoulides. Introduction to psychometric theory. Routledge, 2011

work page 2011
[40]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

work page 2024
[41]

Nonparametric bootstrapping for hierarchical data

Shiquan Ren, Hong Lai, Wenjing Tong, Mostafa Aminzadeh, Xuezhang Hou, and Shenghan Lai. Nonparametric bootstrapping for hierarchical data. Journal of Applied Statistics, 37 0 (9): 0 1487--1498, 2010

work page 2010
[42]

Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P Lalor, Robin Jia, and Jordan Boyd-Graber. Evaluation examples are not equally informative: How should that change nlp leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processi...

work page 2021
[43]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Evaluation data contamination in llms: how do we measure it and (when) does it matter? arXiv preprint arXiv:2411.03923, 2024

Aaditya K Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David Esiobu, Maria Lomeli, Gergely Szilvasy, and Dieuwke Hupkes. Evaluation data contamination in llms: how do we measure it and (when) does it matter? arXiv preprint arXiv:2411.03923, 2024

work page arXiv 2024
[45]

Examining the robustness of llm evaluation to the distributional assumptions of benchmarks

Charlotte Siska, Katerina Marazopoulou, Melissa Ailem, and James Bono. Examining the robustness of llm evaluation to the distributional assumptions of benchmarks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 10406--10421, 2024

work page 2024
[46]

Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, et al

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/...

work page 2023
[47]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 13003--13051, 2023

work page 2023
[48]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

An intellectual history of parametric item response theory models in the twentieth century

David Thissen and Lynne Steinberg. An intellectual history of parametric item response theory models in the twentieth century. Chinese/English Journal of Educational Measurement and Evaluation, 1 0 (1): 0 5, 2020

work page 2020
[50]

Comparing test sets with item response theory

Clara Vania, Phu Mon Htut, William Huang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, and Samuel R Bowman. Comparing test sets with item response theory. arXiv preprint arXiv:2106.00840, 2021

work page arXiv 2021
[51]

Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025

Joshua Vendrow, Edward Vendrow, Sara Beery, and Aleksander Madry. Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025

work page arXiv 2025
[52]

Evaluating general-purpose ai with psychometrics

Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell, Luning Sun, Fang Luo, and Xing Xie. Evaluating general-purpose ai with psychometrics. arXiv preprint arXiv:2310.16379, 2023

work page arXiv 2023
[53]

Cognitive diagnostic models and how they can be useful

Joanna Williamson. Cognitive diagnostic models and how they can be useful. research report. Cambridge University Press & Assessment, 2023

work page 2023
[54]

Improving the robustness of large language models via consistency alignment

Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Shuaiqiang Wang, Chong Meng, Zhicong Cheng, Zhaochun Ren, and Dawei Yin. Improving the robustness of large language models via consistency alignment. arXiv preprint arXiv:2403.14221, 2024

work page arXiv 2024
[55]

Large language models are not robust multiple choice selectors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. arXiv preprint arXiv:2309.03882, 2023

work page arXiv 2023
[56]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

From static benchmarks to adaptive testing: Psychometrics in ai evaluation

Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Zachary A Pardos, Patrick C Kyllonen, Jiyun Zu, Qingyang Mao, Rui Lv, Zhenya Huang, et al. From static benchmarks to adaptive testing: Psychometrics in ai evaluation. arXiv preprint arXiv:2306.10512, 2023

work page arXiv 2023
[58]

Prosa: Assessing and understanding the prompt sensitivity of llms

Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. Prosa: Assessing and understanding the prompt sensitivity of llms. arXiv preprint arXiv:2410.12405, 2024

work page arXiv 2024
[59]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[60]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[61]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

When benchmarks are targets: Revealing the sensitivity of large language model leaderboards

Norah Alzahrani, Hisham Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, et al. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguisti...

work page 2024

[4] [4]

fl-irt-ing with psychometrics to improve nlp bias measurement

Dominik Bachmann, Oskar van der Wal, Edita Chvojka, Willem H Zuidema, Leendert van Maanen, and Katrin Schulz. fl-irt-ing with psychometrics to improve nlp bias measurement. Minds and Machines, 34 0 (4): 0 37, 2024

work page 2024

[5] [5]

The basics of item response theory

Frank B Baker. The basics of item response theory. ERIC, 2001

work page 2001

[6] [6]

Item response theory: Parameter estimation techniques

Frank B Baker and Seock-Ho Kim. Item response theory: Parameter estimation techniques. CRC press, 2004

work page 2004

[7] [7]

Some asymptotic theory for the bootstrap

Peter J Bickel and David A Freedman. Some asymptotic theory for the bootstrap. The annals of statistics, 9 0 (6): 0 1196--1217, 1981

work page 1981

[8] [8]

Generalizability theory

Robert L Brennan. Generalizability theory. In The history of educational measurement, pp.\ 206--231. Routledge, 2021

work page 2021

[9] [9]

Statistical inference

George Casella and Roger Berger. Statistical inference. CRC press, 2024

work page 2024

[10] [10]

Adversarial robustness for machine learning

Pin-Yu Chen and Cho-Jui Hsieh. Adversarial robustness for machine learning. Academic Press, 2022

work page 2022

[11] [11]

Chatbot arena: An open platform for evaluating llms by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[12] [12]

On the Measure of Intelligence

Fran c ois Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[13] [13]

How scores are calculated, 2025

College Board . How scores are calculated, 2025. URL https://satsuite.collegeboard.org/scores/what-scores-mean/how-scores-calculated. Accessed: 2025-05-14

work page 2025

[14] [14]

General- ization or memorization: Data contamination and trustworthy evaluation for large language models, 2024

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. arXiv preprint arXiv:2402.15938, 2024

work page arXiv 2024

[15] [15]

Robustness challenges of large language models in natural language understanding: A survey, 2022

Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and Xia Hu. Robustness challenges of large language models in natural language understanding: A survey, 2022

work page 2022

[16] [16]

Lmentry: A language model benchmark of elementary language tasks

Avia Efrat, Or Honovich, and Omer Levy. Lmentry: A language model benchmark of elementary language tasks. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 10476--10501, 2023

work page 2023

[17] [17]

Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation

Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation. arXiv preprint arXiv:2502.06559, 2025

work page arXiv 2025

[18] [18]

What did i do wrong? quantifying llms' sensitivity and consistency to prompt engineering

Federico Errica, Giuseppe Siracusano, Davide Sanvito, and Roberto Bifulco. What did i do wrong? quantifying llms' sensitivity and consistency to prompt engineering. arXiv preprint arXiv:2406.12334, 2024

work page arXiv 2024

[19] [19]

Bootstrapping clustered data

Christopher A Field and Alan H Welsh. Bootstrapping clustered data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69 0 (3): 0 369--390, 2007

work page 2007

[20] [20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

The emerging science of machine learning benchmarks

Moritz Hardt. The emerging science of machine learning benchmarks. Online at https://mlbenchmarks.org, 2025. Manuscript

work page 2025

[22] [22]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021

[23] [23]

General intelligence disentangled via a generality metric for natural and artificial intelligence

Jos \'e Hern \'a ndez-Orallo, Bao Sheng Loe, Lucy Cheke, Fernando Mart \' nez-Plumed, and Se \'a n \'O h \'E igeartaigh. General intelligence disentangled via a generality metric for natural and artificial intelligence. Scientific reports, 11 0 (1): 0 22822, 2021

work page 2021

[24] [24]

Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement? Intelligence, 106: 0 101858, 2024

David Ili \'c and Gilles E Gignac. Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement? Intelligence, 106: 0 101858, 2024

work page 2024

[25] [25]

Robust prompt optimization for large language models against distribution shifts

Moxin Li, Wenjie Wang, Fuli Feng, Yixin Cao, Jizhi Zhang, and Tat-Seng Chua. Robust prompt optimization for large language models against distribution shifts. arXiv preprint arXiv:2305.13954, 2023

work page arXiv 2023

[26] [26]

Holistic evaluation of language models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=iO4LZibEqW. Featured Certification, Expert Certification, Outs...

work page 2023

[27] [27]

T ruthful QA : Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. T ruthful QA : Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3214--3252, Dublin, Ireland, May 2022. Association for Computatio...

work page doi:10.18653/v1/2022.acl-long.229 2022

[28] [28]

Statistical theories of mental test scores

Frederic M Lord and Melvin R Novick. Statistical theories of mental test scores. IAP, 2008

work page 2008

[29] [29]

tiny B enchmarks: evaluating LLM s with fewer examples

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tiny B enchmarks: evaluating LLM s with fewer examples. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volu...

work page 2024

[30] [30]

Adding error bars to evals: A statistical approach to language model evaluations

Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations. arXiv preprint arXiv:2411.00640, 2024

work page arXiv 2024

[31] [31]

How do we know how smart ai systems are?, 2023

Melanie Mitchell. How do we know how smart ai systems are?, 2023

work page 2023

[32] [32]

Debates on the nature of artificial general intelligence, 2024

Melanie Mitchell. Debates on the nature of artificial general intelligence, 2024

work page 2024

[33] [33]

State of what art? a call for multi-prompt llm evaluation

Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? a call for multi-prompt llm evaluation. Transactions of the Association for Computational Linguistics, 12: 0 933--949, 2024

work page 2024

[34] [34]

Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena

Aidar Myrzakhan, Sondos Mahmoud Bsharat, and Zhiqiang Shen. Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena. arXiv preprint arXiv:2406.07545, 2024

work page arXiv 2024

[35] [35]

On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection

Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In Breakthroughs in statistics: Methodology and distribution, pp.\ 123--150. Springer, 1992

work page 1992

[36] [36]

Numerical optimization

Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999

work page 1999

[37] [37]

Evaluation metrics and statistical tests for machine learning

Oona Rainio, Jarmo Teuho, and Riku Kl \'e n. Evaluation metrics and statistical tests for machine learning. Scientific Reports, 14 0 (1): 0 6086, 2024

work page 2024

[38] [38]

Bender, Alex Hanna, and Amandalynne Paullada

Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada. Ai and the everything in the whole wide world benchmark. In J. Vanschoren and S. Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper...

work page 2021

[39] [39]

Introduction to psychometric theory

Tenko Raykov and George A Marcoulides. Introduction to psychometric theory. Routledge, 2011

work page 2011

[40] [40]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

work page 2024

[41] [41]

Nonparametric bootstrapping for hierarchical data

Shiquan Ren, Hong Lai, Wenjing Tong, Mostafa Aminzadeh, Xuezhang Hou, and Shenghan Lai. Nonparametric bootstrapping for hierarchical data. Journal of Applied Statistics, 37 0 (9): 0 1487--1498, 2010

work page 2010

[42] [42]

Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P Lalor, Robin Jia, and Jordan Boyd-Graber. Evaluation examples are not equally informative: How should that change nlp leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processi...

work page 2021

[43] [43]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Evaluation data contamination in llms: how do we measure it and (when) does it matter? arXiv preprint arXiv:2411.03923, 2024

Aaditya K Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David Esiobu, Maria Lomeli, Gergely Szilvasy, and Dieuwke Hupkes. Evaluation data contamination in llms: how do we measure it and (when) does it matter? arXiv preprint arXiv:2411.03923, 2024

work page arXiv 2024

[45] [45]

Examining the robustness of llm evaluation to the distributional assumptions of benchmarks

Charlotte Siska, Katerina Marazopoulou, Melissa Ailem, and James Bono. Examining the robustness of llm evaluation to the distributional assumptions of benchmarks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 10406--10421, 2024

work page 2024

[46] [46]

Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, et al

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/...

work page 2023

[47] [47]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 13003--13051, 2023

work page 2023

[48] [48]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

An intellectual history of parametric item response theory models in the twentieth century

David Thissen and Lynne Steinberg. An intellectual history of parametric item response theory models in the twentieth century. Chinese/English Journal of Educational Measurement and Evaluation, 1 0 (1): 0 5, 2020

work page 2020

[50] [50]

Comparing test sets with item response theory

Clara Vania, Phu Mon Htut, William Huang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, and Samuel R Bowman. Comparing test sets with item response theory. arXiv preprint arXiv:2106.00840, 2021

work page arXiv 2021

[51] [51]

Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025

Joshua Vendrow, Edward Vendrow, Sara Beery, and Aleksander Madry. Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025

work page arXiv 2025

[52] [52]

Evaluating general-purpose ai with psychometrics

Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell, Luning Sun, Fang Luo, and Xing Xie. Evaluating general-purpose ai with psychometrics. arXiv preprint arXiv:2310.16379, 2023

work page arXiv 2023

[53] [53]

Cognitive diagnostic models and how they can be useful

Joanna Williamson. Cognitive diagnostic models and how they can be useful. research report. Cambridge University Press & Assessment, 2023

work page 2023

[54] [54]

Improving the robustness of large language models via consistency alignment

Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Shuaiqiang Wang, Chong Meng, Zhicong Cheng, Zhaochun Ren, and Dawei Yin. Improving the robustness of large language models via consistency alignment. arXiv preprint arXiv:2403.14221, 2024

work page arXiv 2024

[55] [55]

Large language models are not robust multiple choice selectors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. arXiv preprint arXiv:2309.03882, 2023

work page arXiv 2023

[56] [56]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

From static benchmarks to adaptive testing: Psychometrics in ai evaluation

Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Zachary A Pardos, Patrick C Kyllonen, Jiyun Zu, Qingyang Mao, Rui Lv, Zhenya Huang, et al. From static benchmarks to adaptive testing: Psychometrics in ai evaluation. arXiv preprint arXiv:2306.10512, 2023

work page arXiv 2023

[58] [58]

Prosa: Assessing and understanding the prompt sensitivity of llms

Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. Prosa: Assessing and understanding the prompt sensitivity of llms. arXiv preprint arXiv:2410.12405, 2024

work page arXiv 2024

[59] [59]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[60] [60]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[61] [61]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page