Position: AI Evaluations Should be Grounded on a Theory of Capability
Pith reviewed 2026-05-21 21:53 UTC · model grok-4.3
The pith
AI evaluations should be framed as inference tasks grounded on an explicit theory of capability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that AI evaluations are inferences rather than direct measurements, and that without an explicit theory of capability as a latent construct, the reliability of benchmark results cannot be properly assessed. By showing that performance reports depend strongly on the choice of modeling assumptions, the paper demonstrates the importance of making those assumptions explicit in AI contexts.
What carries the argument
Framing AI evaluation as an inference task that requires an explicit theory of capability as a latent variable.
Load-bearing premise
The assumption that a model's underlying capability is a hidden trait that requires specific modeling choices to connect it to observed test scores, just as in psychological testing.
What would settle it
A study that applies multiple different theories of capability to the same set of AI models and finds that the inferred capabilities remain unchanged would falsify the claim that the theory matters critically.
Figures
read the original abstract
Evaluations of generative models are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet skepticism about their reliability continues to grow. How can we know that a reported accuracy genuinely reflects a model's underlying performance? Although benchmark results are often presented as direct measurements of capability, in practice they are inferences: treating a score as evidence of capability already presupposes a theory of what it means to be capable at a task. We argue that AI evaluations should instead be framed as inference tasks grounded on an explicit theory of capability. While this perspective is standard in fields like psychometrics, it remains underdeveloped in AI evaluation, where core assumptions are often left implicit. As a proof-of-concept, we empirically show that reported performance can depend strongly on the evaluator's modeling assumptions, underscoring the need for transparent, theory-driven evaluation practices. We conclude by offering an Evaluation Card to help researchers document, justify, and scrutinize the modeling decisions underlying AI evaluations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that evaluations of generative AI models are inferences rather than direct measurements because they presuppose an implicit theory of capability. It advocates reframing AI evaluations as explicit inference tasks grounded in a theory of capability, drawing from psychometrics practices. As a proof-of-concept, the paper empirically demonstrates that reported performance can depend strongly on the evaluator's modeling assumptions, and it concludes by proposing an Evaluation Card to document, justify, and scrutinize these modeling decisions.
Significance. If the central claim holds, this position paper could meaningfully advance AI evaluation practices by promoting transparency around modeling assumptions and reducing overinterpretation of benchmark scores. The proof-of-concept illustration of assumption sensitivity is a timely contribution that aligns with growing skepticism about benchmark reliability. Strengths include the clear logical framing and the practical Evaluation Card tool, though the empirical support remains illustrative rather than a full validation of improved inference accuracy.
major comments (1)
- [Proof-of-concept experiment] Proof-of-concept section: the demonstration that different modeling assumptions produce different performance estimates establishes sensitivity to assumptions but does not test whether a psychometrics-derived latent-trait model yields more valid or predictive inferences about underlying capability than simpler alternatives such as standard error bars, existing item-response adjustments in NLP benchmarks, or task-specific causal models. This comparison is load-bearing for the claim that explicit theory-grounded evaluation improves upon current practices.
minor comments (3)
- [Abstract] The abstract could more explicitly preview the structure of the Evaluation Card and its intended use cases.
- [Background on psychometrics] Some citations to foundational psychometrics references (e.g., on local independence or unidimensionality assumptions) would help readers from AI backgrounds follow the transfer argument.
- [Empirical results] Figure captions in the empirical section should include error bars or confidence intervals to clarify the magnitude of assumption-driven variation.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our position paper. We respond to the major comment below, clarifying the scope of our contribution while acknowledging the limits of the current empirical illustration.
read point-by-point responses
-
Referee: Proof-of-concept section: the demonstration that different modeling assumptions produce different performance estimates establishes sensitivity to assumptions but does not test whether a psychometrics-derived latent-trait model yields more valid or predictive inferences about underlying capability than simpler alternatives such as standard error bars, existing item-response adjustments in NLP benchmarks, or task-specific causal models. This comparison is load-bearing for the claim that explicit theory-grounded evaluation improves upon current practices.
Authors: We agree that the proof-of-concept establishes sensitivity to modeling assumptions without directly comparing the validity or predictive power of a latent-trait model against alternatives such as standard error bars or existing item-response theory adjustments. As a position paper, our central claim is that AI evaluations are inferences that already presuppose some (often implicit) theory of capability, and that making this theory explicit enables better scrutiny and transparency. The empirical demonstration is intended to illustrate the practical consequences of differing assumptions rather than to validate any specific modeling framework as superior. We do not claim in the manuscript that a psychometrics-derived approach is empirically better than the listed alternatives; instead, we argue that current practices would benefit from explicit documentation of whatever theory is being used. We will revise the manuscript to more explicitly state the illustrative purpose of the experiment, to avoid any implication of validated superiority, and to identify comparative validation studies as an important direction for future work. This revision addresses the referee's concern by sharpening the framing without expanding the paper's scope beyond a position piece. revision: partial
Circularity Check
No circularity: conceptual argument and illustrative demo are independent of inputs
full rationale
The paper advances a position that benchmark scores are inferences presupposing an implicit theory of capability, advocating explicit modeling drawn from psychometrics. This rests on logical analysis of evaluation practices rather than any derivation, equation, or fit. The proof-of-concept empirically illustrates that performance numbers vary with modeling assumptions; this is a direct consequence of changing the assumptions and does not reduce to a fitted parameter renamed as prediction or any self-referential construction. No self-citations are load-bearing, no uniqueness theorems are imported from the authors' prior work, and no ansatz is smuggled in. The Evaluation Card is a documentation template, not a derived result. The chain is self-contained against external benchmarks from psychometrics and standard evaluation critique.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Benchmark scores are inferences that already presuppose a theory of capability
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We argue that AI evaluations should instead be framed as inference tasks grounded on an explicit theory of capability... start from a theory of performance, and develop methods for inference from that theory.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ϕi = θi + s(xi) + ϵi ... Assumption 2... clustered bootstrapping... adaptive test based on item response theory
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
When benchmarks are targets: Revealing the sensitivity of large language model leaderboards
Norah Alzahrani, Hisham Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, et al. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguisti...
work page 2024
-
[4]
fl-irt-ing with psychometrics to improve nlp bias measurement
Dominik Bachmann, Oskar van der Wal, Edita Chvojka, Willem H Zuidema, Leendert van Maanen, and Katrin Schulz. fl-irt-ing with psychometrics to improve nlp bias measurement. Minds and Machines, 34 0 (4): 0 37, 2024
work page 2024
-
[5]
The basics of item response theory
Frank B Baker. The basics of item response theory. ERIC, 2001
work page 2001
-
[6]
Item response theory: Parameter estimation techniques
Frank B Baker and Seock-Ho Kim. Item response theory: Parameter estimation techniques. CRC press, 2004
work page 2004
-
[7]
Some asymptotic theory for the bootstrap
Peter J Bickel and David A Freedman. Some asymptotic theory for the bootstrap. The annals of statistics, 9 0 (6): 0 1196--1217, 1981
work page 1981
-
[8]
Robert L Brennan. Generalizability theory. In The history of educational measurement, pp.\ 206--231. Routledge, 2021
work page 2021
-
[9]
George Casella and Roger Berger. Statistical inference. CRC press, 2024
work page 2024
-
[10]
Adversarial robustness for machine learning
Pin-Yu Chen and Cho-Jui Hsieh. Adversarial robustness for machine learning. Academic Press, 2022
work page 2022
-
[11]
Chatbot arena: An open platform for evaluating llms by human preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[12]
On the Measure of Intelligence
Fran c ois Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[13]
How scores are calculated, 2025
College Board . How scores are calculated, 2025. URL https://satsuite.collegeboard.org/scores/what-scores-mean/how-scores-calculated. Accessed: 2025-05-14
work page 2025
-
[14]
Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. arXiv preprint arXiv:2402.15938, 2024
-
[15]
Robustness challenges of large language models in natural language understanding: A survey, 2022
Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and Xia Hu. Robustness challenges of large language models in natural language understanding: A survey, 2022
work page 2022
-
[16]
Lmentry: A language model benchmark of elementary language tasks
Avia Efrat, Or Honovich, and Omer Levy. Lmentry: A language model benchmark of elementary language tasks. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 10476--10501, 2023
work page 2023
-
[17]
Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation
Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation. arXiv preprint arXiv:2502.06559, 2025
-
[18]
What did i do wrong? quantifying llms' sensitivity and consistency to prompt engineering
Federico Errica, Giuseppe Siracusano, Davide Sanvito, and Roberto Bifulco. What did i do wrong? quantifying llms' sensitivity and consistency to prompt engineering. arXiv preprint arXiv:2406.12334, 2024
-
[19]
Christopher A Field and Alan H Welsh. Bootstrapping clustered data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69 0 (3): 0 369--390, 2007
work page 2007
-
[20]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
The emerging science of machine learning benchmarks
Moritz Hardt. The emerging science of machine learning benchmarks. Online at https://mlbenchmarks.org, 2025. Manuscript
work page 2025
-
[22]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[23]
General intelligence disentangled via a generality metric for natural and artificial intelligence
Jos \'e Hern \'a ndez-Orallo, Bao Sheng Loe, Lucy Cheke, Fernando Mart \' nez-Plumed, and Se \'a n \'O h \'E igeartaigh. General intelligence disentangled via a generality metric for natural and artificial intelligence. Scientific reports, 11 0 (1): 0 22822, 2021
work page 2021
-
[24]
David Ili \'c and Gilles E Gignac. Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement? Intelligence, 106: 0 101858, 2024
work page 2024
-
[25]
Robust prompt optimization for large language models against distribution shifts
Moxin Li, Wenjie Wang, Fuli Feng, Yixin Cao, Jizhi Zhang, and Tat-Seng Chua. Robust prompt optimization for large language models against distribution shifts. arXiv preprint arXiv:2305.13954, 2023
-
[26]
Holistic evaluation of language models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=iO4LZibEqW. Featured Certification, Expert Certification, Outs...
work page 2023
-
[27]
T ruthful QA : Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. T ruthful QA : Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3214--3252, Dublin, Ireland, May 2022. Association for Computatio...
-
[28]
Statistical theories of mental test scores
Frederic M Lord and Melvin R Novick. Statistical theories of mental test scores. IAP, 2008
work page 2008
-
[29]
tiny B enchmarks: evaluating LLM s with fewer examples
Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tiny B enchmarks: evaluating LLM s with fewer examples. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volu...
work page 2024
-
[30]
Adding error bars to evals: A statistical approach to language model evaluations
Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations. arXiv preprint arXiv:2411.00640, 2024
-
[31]
How do we know how smart ai systems are?, 2023
Melanie Mitchell. How do we know how smart ai systems are?, 2023
work page 2023
-
[32]
Debates on the nature of artificial general intelligence, 2024
Melanie Mitchell. Debates on the nature of artificial general intelligence, 2024
work page 2024
-
[33]
State of what art? a call for multi-prompt llm evaluation
Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? a call for multi-prompt llm evaluation. Transactions of the Association for Computational Linguistics, 12: 0 933--949, 2024
work page 2024
-
[34]
Aidar Myrzakhan, Sondos Mahmoud Bsharat, and Zhiqiang Shen. Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena. arXiv preprint arXiv:2406.07545, 2024
-
[35]
Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In Breakthroughs in statistics: Methodology and distribution, pp.\ 123--150. Springer, 1992
work page 1992
-
[36]
Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999
work page 1999
-
[37]
Evaluation metrics and statistical tests for machine learning
Oona Rainio, Jarmo Teuho, and Riku Kl \'e n. Evaluation metrics and statistical tests for machine learning. Scientific Reports, 14 0 (1): 0 6086, 2024
work page 2024
-
[38]
Bender, Alex Hanna, and Amandalynne Paullada
Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada. Ai and the everything in the whole wide world benchmark. In J. Vanschoren and S. Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper...
work page 2021
-
[39]
Introduction to psychometric theory
Tenko Raykov and George A Marcoulides. Introduction to psychometric theory. Routledge, 2011
work page 2011
-
[40]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98
work page 2024
-
[41]
Nonparametric bootstrapping for hierarchical data
Shiquan Ren, Hong Lai, Wenjing Tong, Mostafa Aminzadeh, Xuezhang Hou, and Shenghan Lai. Nonparametric bootstrapping for hierarchical data. Journal of Applied Statistics, 37 0 (9): 0 1487--1498, 2010
work page 2010
-
[42]
Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P Lalor, Robin Jia, and Jordan Boyd-Graber. Evaluation examples are not equally informative: How should that change nlp leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processi...
work page 2021
-
[43]
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Aaditya K Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David Esiobu, Maria Lomeli, Gergely Szilvasy, and Dieuwke Hupkes. Evaluation data contamination in llms: how do we measure it and (when) does it matter? arXiv preprint arXiv:2411.03923, 2024
-
[45]
Examining the robustness of llm evaluation to the distributional assumptions of benchmarks
Charlotte Siska, Katerina Marazopoulou, Melissa Ailem, and James Bono. Examining the robustness of llm evaluation to the distributional assumptions of benchmarks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 10406--10421, 2024
work page 2024
-
[46]
Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, et al
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/...
work page 2023
-
[47]
Challenging big-bench tasks and whether chain-of-thought can solve them
Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 13003--13051, 2023
work page 2023
-
[48]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
An intellectual history of parametric item response theory models in the twentieth century
David Thissen and Lynne Steinberg. An intellectual history of parametric item response theory models in the twentieth century. Chinese/English Journal of Educational Measurement and Evaluation, 1 0 (1): 0 5, 2020
work page 2020
-
[50]
Comparing test sets with item response theory
Clara Vania, Phu Mon Htut, William Huang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, and Samuel R Bowman. Comparing test sets with item response theory. arXiv preprint arXiv:2106.00840, 2021
-
[51]
Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025
Joshua Vendrow, Edward Vendrow, Sara Beery, and Aleksander Madry. Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025
-
[52]
Evaluating general-purpose ai with psychometrics
Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell, Luning Sun, Fang Luo, and Xing Xie. Evaluating general-purpose ai with psychometrics. arXiv preprint arXiv:2310.16379, 2023
-
[53]
Cognitive diagnostic models and how they can be useful
Joanna Williamson. Cognitive diagnostic models and how they can be useful. research report. Cambridge University Press & Assessment, 2023
work page 2023
-
[54]
Improving the robustness of large language models via consistency alignment
Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Shuaiqiang Wang, Chong Meng, Zhicong Cheng, Zhaochun Ren, and Dawei Yin. Improving the robustness of large language models via consistency alignment. arXiv preprint arXiv:2403.14221, 2024
-
[55]
Large language models are not robust multiple choice selectors
Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. arXiv preprint arXiv:2309.03882, 2023
-
[56]
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
From static benchmarks to adaptive testing: Psychometrics in ai evaluation
Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Zachary A Pardos, Patrick C Kyllonen, Jiyun Zu, Qingyang Mao, Rui Lv, Zhenya Huang, et al. From static benchmarks to adaptive testing: Psychometrics in ai evaluation. arXiv preprint arXiv:2306.10512, 2023
-
[58]
Prosa: Assessing and understanding the prompt sensitivity of llms
Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. Prosa: Assessing and understanding the prompt sensitivity of llms. arXiv preprint arXiv:2410.12405, 2024
-
[59]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[60]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[61]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.