Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

Augustin Kelava; Florian E. Dorner; Olawale Salaudeen; Samira Samadi; Tom S\"uhr

arxiv: 2507.23009 · v2 · submitted 2025-07-30 · 💻 cs.LG · cs.AI

Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

Tom S\"uhr , Florian E. Dorner , Olawale Salaudeen , Augustin Kelava , Samira Samadi This is my paper

Pith reviewed 2026-05-19 02:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords AI evaluationbenchmarkspsychometricslarge language modelsmeasurement validityontological errorAI-specific tests

0 comments

The pith

Human tests do not measure intelligence or other traits in AI systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors argue that standardized tests for intelligence, personality, and other traits in humans cannot be directly used on AI systems. These tests are developed based on theories of human cognition and validated through studies on human populations. Applying them to large language models without fresh validation for AI risks producing scores that do not correspond to the traits they claim to assess. The position paper therefore calls for halting this practice and instead building evaluation methods designed specifically around the properties of artificial systems. Such a change would help clarify what current benchmarks actually reveal about AI performance.

Core claim

The central claim is that benchmark performance on human-designed tests for LLMs lacks sufficient theoretical and empirical justification to support interpretations as measurements of human-like traits. Human psychological and educational tests are theory-driven measurement instruments calibrated to a specific human population, so their direct application to non-human subjects without new validation risks mischaracterizing what is being measured.

What carries the argument

The requirement that measurement instruments receive empirical validation when transferred from their original calibrated human population to non-human AI subjects.

If this is right

AI research should stop relying on unvalidated human tests such as IQ or personality inventories for evaluating models.
New evaluation frameworks must be built specifically for AI, either by adapting psychometrics principles or creating them from scratch.
Such frameworks would need to address known problems including data contamination, cultural bias, and sensitivity to prompt changes.
Clearer tests would reduce the risk of misinterpreting what benchmark scores actually indicate about AI capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

AI-specific tests could identify capabilities and failure modes that have no direct parallel in human trait categories.
This shift might produce more reliable assessments for AI safety by focusing on actual system behaviors instead of assumed human analogies.
Controlled experiments could check whether selected items from human tests show any transfer validity to AI under strict conditions.

Load-bearing premise

Human psychological and educational tests are valid measurement tools only for the human populations on which they were calibrated and cannot transfer directly to AI without separate validation studies.

What would settle it

An empirical study that applies a human test to AI systems, demonstrates through response patterns and external criteria that the scores measure the same underlying trait as in humans, and shows consistent validity across both populations.

Figures

Figures reproduced from arXiv: 2507.23009 by Augustin Kelava, Florian E. Dorner, Olawale Salaudeen, Samira Samadi, Tom S\"uhr.

**Figure 2.** Figure 2: Measurement models of two latent variables [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems. Such frameworks might build on existing frameworks for constructing and validating psychometrics tests, or could be created entirely from scratch to fit the unique context of AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Human tests lack validity for AI so we should build evaluations designed for machines instead.

read the letter

This paper's main point is that we should stop treating AI performance on human psychological and educational tests as evidence of human-like traits. The authors say these tests are theory-driven instruments built and calibrated for people, so applying them to LLMs without new validation creates an ontological mismatch and risks misreading what the scores actually mean. They list familiar problems like data contamination, prompt sensitivity, and cultural bias as symptoms of this deeper issue and call for AI-specific frameworks instead, which could adapt psychometric methods or start from scratch.

Referee Report

1 major / 2 minor

Summary. The manuscript is a position paper arguing that interpreting LLM performance on human-designed psychological and educational tests (e.g., for intelligence or personality) as evidence of human-like traits constitutes an ontological error. These tests are theory-driven instruments calibrated to specific human populations; applying them directly to AI systems without empirical validation risks mischaracterizing what is measured. The authors cite known problems with current benchmarks (data contamination, cultural bias, prompt sensitivity) and call for developing principled AI-specific evaluation frameworks, which could either adapt psychometrics methods or be built from scratch.

Significance. If the position is adopted, the paper could encourage the AI evaluation community to prioritize validity and population-specific calibration over anthropomorphic interpretations of benchmarks. This would support more accurate assessment of AI capabilities on their own terms and reduce overclaiming based on human-normed instruments.

major comments (1)

The central claim (abstract and opening sections) that human tests lack justification for AI use rests on construct-validity principles from psychometrics, but the manuscript provides no concrete illustration of how a specific test (e.g., an IQ or personality inventory) produces a mischaracterization when applied to an LLM; without such an example the argument remains at a high level of generality.

minor comments (2)

The term 'ontological error' is used without a brief definition or reference; adding one sentence of clarification would improve accessibility for readers outside measurement theory.
The call for new frameworks would be strengthened by citing at least two existing AI-specific evaluation proposals (e.g., from the literature on benchmark validity or agent testing) to show how the position relates to ongoing work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and recommendation for minor revision. The feedback highlights an opportunity to make our position more concrete, and we address it directly below while committing to revisions that preserve the manuscript's scope as a position paper.

read point-by-point responses

Referee: The central claim (abstract and opening sections) that human tests lack justification for AI use rests on construct-validity principles from psychometrics, but the manuscript provides no concrete illustration of how a specific test (e.g., an IQ or personality inventory) produces a mischaracterization when applied to an LLM; without such an example the argument remains at a high level of generality.

Authors: We agree that an illustrative example would improve accessibility without altering the core argument, which rests on established psychometric principles that tests must be validated for the target population and construct. In the revised manuscript we will add a brief, literature-based illustration in the introduction: applying a Big Five personality inventory to LLMs, where prompt variations produce unstable trait profiles despite high average scores, demonstrating that the instrument fails to measure a stable construct in AI systems. This draws on documented prompt-sensitivity results rather than new experiments, keeping the paper focused on the position. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a position paper that advances a conceptual argument based on established principles of psychometrics and measurement theory (construct validity, population-specific calibration). It contains no derivations, equations, fitted parameters, predictions, or self-citations that reduce any claim to its own inputs by construction. The central position follows directly from standard domain knowledge without internal reduction or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The argument rests on standard assumptions from psychometrics and measurement theory rather than new postulates or fitted values.

axioms (1)

domain assumption Human psychological and educational tests are theory-driven measurement instruments calibrated to a specific human population.
Invoked directly in the abstract to ground the claim that application to AI constitutes an ontological error without empirical validation.

pith-pipeline@v0.9.0 · 5749 in / 1218 out tokens · 28411 ms · 2026-05-19T02:09:59.855838+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Machine individuality: Separating genuine idiosyncrasy from response bias in large language models
cs.AI 2026-04 unverdicted novelty 7.0

Crossed random-effects models on LLM word ratings show 16.9% variance from genuine stimulus-specific individuality, exceeding null models and forming coherent per-model fingerprints.
Position: Science of AI Evaluation Requires Item-level Benchmark Data
cs.AI 2026-02 unverdicted novelty 5.0

Item-level benchmark data is essential for rigorous AI evaluation because it enables fine-grained diagnostics and principled validation of benchmarks that aggregate scores cannot provide.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 2 Pith papers · 12 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Adebayo, J

J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliency maps.Advances in neural information processing systems, 31, 2018

work page 2018
[3]

American Educational Research Association, A. P. Association, and N. C. on Measurement in Education. Standards for educational and psychological testing, 2014

work page 2014
[4]

Construct

American Psychological Association. Construct. https://dictionary.apa.org/ construct, n.d. Accessed: 2025-05-20

work page 2025
[5]

Anthropic

A. Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1:1, 2024

work page 2024
[6]

Baddeley

A. Baddeley. Psychology of learning and motivation.(No Title), 8:47, 1974

work page 1974
[7]

Baddeley

A. Baddeley. Working memory.Current biology, 20(4):R136–R140, 2010

work page 2010
[8]

Barocas, M

S. Barocas, M. Hardt, and A. Narayanan.Fairness and machine learning: Limitations and opportunities. MIT press, 2023

work page 2023
[9]

K. A. Bollen.Structural equations with latent variables, volume 210. John Wiley & Sons, 1989

work page 1989
[10]

Borsboom, G

D. Borsboom, G. J. Mellenbergh, and J. Van Heerden. The theoretical status of latent variables. Psychological review, 110(2):203, 2003

work page 2003
[11]

Borsboom, G

D. Borsboom, G. J. Mellenbergh, and J. Van Heerden. The concept of validity.Psychological review, 111(4):1061, 2004

work page 2004
[12]

C. J. Brainerd and J. Kingma. On the independence of short-term memory and working memory in cognitive development.Cognitive Psychology, 17(2):210–247, 1985

work page 1985
[13]

Caron and S

G. Caron and S. Srivastava. Manipulating the perceived personality traits of language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2370–2386, 2023. 11

work page 2023
[14]

From tools to thieves: Measuring and understanding public perceptions of AI through crowdsourced metaphors

M. Cheng, A. Y . Lee, K. Rapuano, K. Niederhoffer, A. Liebscher, and J. Hancock. From tools to thieves: Measuring and understanding public perceptions of ai through crowdsourced metaphors.arXiv preprint arXiv:2501.18045, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

L. M. Collins and S. T. Lanza.Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. John Wiley & Sons, 2009

work page 2009
[17]

N. Cowan. Evolving conceptions of memory storage, selective attention, and their mutual constraints within the human information-processing system.Psychological bulletin, 104(2): 163, 1988

work page 1988
[18]

L. J. Cronbach and P. E. Meehl. Construct validity in psychological tests.Psychological bulletin, 52(4):281, 1955

work page 1955
[19]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[20]

Dominguez-Olmedo, M

R. Dominguez-Olmedo, M. Hardt, and C. Mendler-Dünner. Questioning the survey responses of large language models.arXiv preprint arXiv:2306.07951, 2023

work page arXiv 2023
[21]

Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

R. Dominguez-Olmedo, F. E. Dorner, and M. Hardt. Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

work page arXiv 2024
[22]

Dwork, M

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012

work page 2012
[23]

R. W. Engle, S. W. Tuholski, J. E. Laughlin, and A. R. Conway. Working memory, short-term memory, and general fluid intelligence: a latent-variable approach.Journal of experimental psychology: General, 128(3):309, 1999

work page 1999
[24]

Feyerabend.Against method: Outline of an anarchistic theory of knowledge

P. Feyerabend.Against method: Outline of an anarchistic theory of knowledge. Verso Books, 2020

work page 2020
[25]

Fourrier, N

C. Fourrier, N. Habib, A. Lozovskaya, K. Szafer, and T. Wolf. Open llm leaderboard v2.https: //huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024

work page 2024
[26]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/ records...

work page arXiv 2024
[27]

A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y . Zhao, X. Du, M. R. G. Madani, et al. Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024

work page arXiv 2024
[28]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Changing answer order can decrease MMLU accuracy.arXiv preprint arXiv:2406.19470,

V . Gupta, D. Pantoja, C. Ross, A. Williams, and M. Ung. Changing answer order can decrease mmlu accuracy.arXiv preprint arXiv:2406.19470, 2024

work page arXiv 2024
[30]

M. Hardt. The emerging science of machine learning benchmarks, 2025

work page 2025
[31]

Hardt, E

M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning.Advances in neural information processing systems, 29, 2016

work page 2016
[32]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[33]

Huang and O

J. Huang and O. Li. Measuring the iq of mainstream large language models in chinese using the wechsler adult intelligence scale.Authorea Preprints, 2024

work page 2024
[34]

Huang, W

J.-t. Huang, W. Wang, E. J. Li, M. H. LAM, S. Ren, Y . Yuan, W. Jiao, Z. Tu, and M. Lyu. On the humanity of conversational ai: Evaluating the psychological portrayal of llms. InThe Twelfth International Conference on Learning Representations, 2023. 12

work page 2023
[35]

Jiang, M

G. Jiang, M. Xu, S.-C. Zhu, W. Han, C. Zhang, and Y . Zhu. Evaluating and inducing personality in pre-trained language models. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=I9xE1Jsjfx

work page 2023
[36]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[37]

D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo. Gpt-4 passes the bar exam.Philo- sophical Transactions of the Royal Society A, 382(2270):20230254, 2024

work page 2024
[38]

Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025

M. Kazemi, B. Fatemi, H. Bansal, J. Palowitch, C. Anastasiou, S. V . Mehta, L. K. Jain, V . Aglietti, D. Jindal, P. Chen, et al. Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025

work page arXiv 2025
[39]

Kipnis, K

A. Kipnis, K. V oudouris, L. M. S. Buschoff, and E. Schulz. metabench–a sparse benchmark to measure general ability in large language models.arXiv preprint arXiv:2407.12844, 2024

work page arXiv 2024
[40]

E. Lawrie. Can ai therapists really be an alternative to human help?, 2025. URL https: //www.bbc.com/news/articles/ced2ywg7246o. Accessed: 2025-05-22

work page 2025
[41]

LeCun, L

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

work page 1998
[42]

X. Li, Y . Li, L. Liu, L. Bing, and S. Joty. Is gpt-3 a psychopath? evaluating large language models from a psychological perspective.arXiv preprint arXiv:2212.10529, 2022

work page arXiv 2022
[43]

Holistic Evaluation of Language Models

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

R. W. Lissitz and K. Samuelsen. A suggested change in terminology and emphasis regarding validity and education.Educational researcher, 36(8):437–448, 2007

work page 2007
[45]

R. T. McCoy, S. Yao, D. Friedman, M. Hardy, and T. L. Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve.arXiv preprint arXiv:2309.13638, 2023

work page arXiv 2023
[46]

P. E. Meehl. Theoretical risks and tabular asterisks: Sir karl, sir ronald, and the slow progress of soft psychology. 1992

work page 1992
[47]

Q. Mei, Y . Xie, W. Yuan, and M. O. Jackson. A turing test of whether ai chatbots are behaviorally similar to humans.Proceedings of the National Academy of Sciences, 121(9):e2313925121, 2024

work page 2024
[48]

Meredith

W. Meredith. Measurement invariance, factor analysis and factorial invariance.Psychome- trika, 58(4):525–543, 1993. doi: 10.1007/BF02294825. URL https://doi.org/10.1007/ BF02294825

work page doi:10.1007/bf02294825 1993
[49]

S. Merken. Judge fines lawyers in walmart lawsuit over fake, ai-generated cases.Reuters, 2025. URL https://www.reuters.com/legal/government/ judge-fines-lawyers-walmart-lawsuit-over-fake-ai-generated-cases-2025-02-25/

work page 2025
[50]

Miller, K

J. Miller, K. Krauth, B. Recht, and L. Schmidt. The effect of natural distribution shift on question answering models. InInternational conference on machine learning, pages 6905–6916. PMLR, 2020

work page 2020
[51]

J. P. Miller, R. Taori, A. Raghunathan, S. Sagawa, P. W. Koh, V . Shankar, P. Liang, Y . Carmon, and L. Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. InInternational conference on machine learning, pages 7721–7735. PMLR, 2021

work page 2021
[52]

M. Miller. Chatbots are being sued for refusing to answer questions. is that a free speech issue?, Apr. 2024. URL https://mashable.com/article/chatbots-lawsuit-free-speech . Mashable. Accessed: 2025-05-13

work page 2024
[53]

R. E. Nisbett, J. Aronson, C. Blair, W. Dickens, J. Flynn, D. F. Halpern, and E. Turkheimer. Intelligence: new findings and theoretical developments.American psychologist, 67(2):130, 2012

work page 2012
[54]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303. 08774. URLhttps://doi.org/10.48550/arXiv.2303.08774. 13

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303 2023
[55]

Gpt-4.1 overview

OpenAI. Gpt-4.1 overview. https://openai.com/index/gpt-4-1/, 2024. Accessed: 2025-04-14

work page 2024
[56]

Introducing the openai academy, September 2024

OpenAI. Introducing the openai academy, September 2024. URL https://openai.com/ global-affairs/openai-academy/. Accessed: 2025-04-13

work page 2024
[57]

Pearl.Causality

J. Pearl.Causality. Cambridge university press, 2009

work page 2009
[58]

Popper.Logik der forschung

K. Popper.Logik der forschung. Akademie Verlag, 2013

work page 2013
[59]

Rahwan, M

I. Rahwan, M. Cebrian, N. Obradovich, J. Bongard, J.-F. Bonnefon, C. Breazeal, J. W. Crandall, N. A. Christakis, I. D. Couzin, M. O. Jackson, et al. Machine behaviour.Nature, 568(7753): 477–486, 2019

work page 2019
[60]

Recht, R

B. Recht, R. Roelofs, L. Schmidt, and V . Shankar. Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019

work page 2019
[61]

W. R. Revelle. psych: Procedures for personality and psychological research. 2017

work page 2017
[62]

K. Roose. Can a.i. be blamed for a teen’s suicide?, Oct. 2024. URL https://www.nytimes. com/2024/10/23/technology/characterai-lawsuit-teen-suicide.html . The New York Times. Accessed: 2025-05-13

work page 2024
[63]

Y . Ruan, C. J. Maddison, and T. Hashimoto. Observational scaling laws and the predictability of language model performance, 2024.URL https://arxiv. org/abs/2405.10938

work page arXiv 2024
[64]

C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019

work page 2019
[65]

Salaudeen and M

O. Salaudeen and M. Hardt. Imagenot: A contrast with imagenet preserves model rankings. arXiv preprint arXiv:2404.02112, 2024

work page arXiv 2024
[66]

Salaudeen, N

O. Salaudeen, N. Chiou, S. Weng, and S. Koyejo. Are domain generalization benchmarks with accuracy on the line misspecified?arXiv preprint arXiv:2504.00186, 2025

work page arXiv 2025
[67]

Salaudeen, A

O. Salaudeen, A. Reuel, A. Ahmed, S. Bedi, Z. Robertson, S. Sundar, B. Domingue, A. Wang, and S. Koyejo. Measurement to meaning: A validity-centered framework for ai evaluation. preprint, 2025

work page 2025
[68]

Santurkar, E

S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto. Whose opinions do language models reflect?arXiv preprint arXiv:2303.17548, 2023

work page arXiv 2023
[69]

Z. Shi, N. Carlini, A. Balashankar, L. Schmidt, C.-J. Hsieh, A. Beutel, and Y . Qin. Effective robustness against natural distribution shifts for models with different training data.Advances in Neural Information Processing Systems, 36:73543–73558, 2023

work page 2023
[70]

Singh, A

S. Singh, A. Romanou, C. Fourrier, D. I. Adelani, J. G. Ngui, D. Vila-Suero, P. Limkonchotiwat, K. Marchisio, W. Q. Leong, Y . Susanto, et al. Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation.arXiv preprint arXiv:2412.03304, 2024

work page arXiv 2024
[71]

C. J. Soto and O. P. John. The next big five inventory (bfi-2): Developing and assessing a hierarchical model with 15 facets to enhance bandwidth, fidelity, and predictive power.Journal of Personality and Social psychology, 113(1):117, 2017

work page 2017
[72]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[73]

J. W. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7):1285–1295, 2024

work page 2024
[74]

Strevens.The knowledge machine: How an unreasonable idea created modern science

M. Strevens.The knowledge machine: How an unreasonable idea created modern science. Penguin UK, 2020

work page 2020
[75]

T. Sühr, F. E. Dorner, S. Samadi, and A. Kelava. Challenging the validity of personality tests for large language models.Preprint at arXiv. arXiv-2311 https://doi. org/10.48550/arXiv, 2311, 2023

work page internal anchor Pith review doi:10.48550/arxiv 2023
[76]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022. 14

work page internal anchor Pith review Pith/arXiv arXiv 2022
[77]

Teney, Y

D. Teney, Y . Lin, S. J. Oh, and E. Abbasnejad. Id and ood performance are sometimes inversely correlated on real-world datasets.Advances in Neural Information Processing Systems, 36: 71703–71722, 2023

work page 2023
[78]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi- task benchmark and analysis platform for natural language understanding.arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[79]

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding bench- mark. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024
[80]

Wasilewski and M

E. Wasilewski and M. Jablonski. Measuring the perceived iq of multimodal large language models using standardized iq tests.Authorea Preprints, 2024

work page 2024

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Adebayo, J

J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliency maps.Advances in neural information processing systems, 31, 2018

work page 2018

[3] [3]

American Educational Research Association, A. P. Association, and N. C. on Measurement in Education. Standards for educational and psychological testing, 2014

work page 2014

[4] [4]

Construct

American Psychological Association. Construct. https://dictionary.apa.org/ construct, n.d. Accessed: 2025-05-20

work page 2025

[5] [5]

Anthropic

A. Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1:1, 2024

work page 2024

[6] [6]

Baddeley

A. Baddeley. Psychology of learning and motivation.(No Title), 8:47, 1974

work page 1974

[7] [7]

Baddeley

A. Baddeley. Working memory.Current biology, 20(4):R136–R140, 2010

work page 2010

[8] [8]

Barocas, M

S. Barocas, M. Hardt, and A. Narayanan.Fairness and machine learning: Limitations and opportunities. MIT press, 2023

work page 2023

[9] [9]

K. A. Bollen.Structural equations with latent variables, volume 210. John Wiley & Sons, 1989

work page 1989

[10] [10]

Borsboom, G

D. Borsboom, G. J. Mellenbergh, and J. Van Heerden. The theoretical status of latent variables. Psychological review, 110(2):203, 2003

work page 2003

[11] [11]

Borsboom, G

D. Borsboom, G. J. Mellenbergh, and J. Van Heerden. The concept of validity.Psychological review, 111(4):1061, 2004

work page 2004

[12] [12]

C. J. Brainerd and J. Kingma. On the independence of short-term memory and working memory in cognitive development.Cognitive Psychology, 17(2):210–247, 1985

work page 1985

[13] [13]

Caron and S

G. Caron and S. Srivastava. Manipulating the perceived personality traits of language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2370–2386, 2023. 11

work page 2023

[14] [14]

From tools to thieves: Measuring and understanding public perceptions of AI through crowdsourced metaphors

M. Cheng, A. Y . Lee, K. Rapuano, K. Niederhoffer, A. Liebscher, and J. Hancock. From tools to thieves: Measuring and understanding public perceptions of ai through crowdsourced metaphors.arXiv preprint arXiv:2501.18045, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

L. M. Collins and S. T. Lanza.Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. John Wiley & Sons, 2009

work page 2009

[17] [17]

N. Cowan. Evolving conceptions of memory storage, selective attention, and their mutual constraints within the human information-processing system.Psychological bulletin, 104(2): 163, 1988

work page 1988

[18] [18]

L. J. Cronbach and P. E. Meehl. Construct validity in psychological tests.Psychological bulletin, 52(4):281, 1955

work page 1955

[19] [19]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009

[20] [20]

Dominguez-Olmedo, M

R. Dominguez-Olmedo, M. Hardt, and C. Mendler-Dünner. Questioning the survey responses of large language models.arXiv preprint arXiv:2306.07951, 2023

work page arXiv 2023

[21] [21]

Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

R. Dominguez-Olmedo, F. E. Dorner, and M. Hardt. Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

work page arXiv 2024

[22] [22]

Dwork, M

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012

work page 2012

[23] [23]

R. W. Engle, S. W. Tuholski, J. E. Laughlin, and A. R. Conway. Working memory, short-term memory, and general fluid intelligence: a latent-variable approach.Journal of experimental psychology: General, 128(3):309, 1999

work page 1999

[24] [24]

Feyerabend.Against method: Outline of an anarchistic theory of knowledge

P. Feyerabend.Against method: Outline of an anarchistic theory of knowledge. Verso Books, 2020

work page 2020

[25] [25]

Fourrier, N

C. Fourrier, N. Habib, A. Lozovskaya, K. Szafer, and T. Wolf. Open llm leaderboard v2.https: //huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024

work page 2024

[26] [26]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/ records...

work page arXiv 2024

[27] [27]

A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y . Zhao, X. Du, M. R. G. Madani, et al. Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024

work page arXiv 2024

[28] [28]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Changing answer order can decrease MMLU accuracy.arXiv preprint arXiv:2406.19470,

V . Gupta, D. Pantoja, C. Ross, A. Williams, and M. Ung. Changing answer order can decrease mmlu accuracy.arXiv preprint arXiv:2406.19470, 2024

work page arXiv 2024

[30] [30]

M. Hardt. The emerging science of machine learning benchmarks, 2025

work page 2025

[31] [31]

Hardt, E

M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning.Advances in neural information processing systems, 29, 2016

work page 2016

[32] [32]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[33] [33]

Huang and O

J. Huang and O. Li. Measuring the iq of mainstream large language models in chinese using the wechsler adult intelligence scale.Authorea Preprints, 2024

work page 2024

[34] [34]

Huang, W

J.-t. Huang, W. Wang, E. J. Li, M. H. LAM, S. Ren, Y . Yuan, W. Jiao, Z. Tu, and M. Lyu. On the humanity of conversational ai: Evaluating the psychological portrayal of llms. InThe Twelfth International Conference on Learning Representations, 2023. 12

work page 2023

[35] [35]

Jiang, M

G. Jiang, M. Xu, S.-C. Zhu, W. Han, C. Zhang, and Y . Zhu. Evaluating and inducing personality in pre-trained language models. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=I9xE1Jsjfx

work page 2023

[36] [36]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[37] [37]

D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo. Gpt-4 passes the bar exam.Philo- sophical Transactions of the Royal Society A, 382(2270):20230254, 2024

work page 2024

[38] [38]

Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025

M. Kazemi, B. Fatemi, H. Bansal, J. Palowitch, C. Anastasiou, S. V . Mehta, L. K. Jain, V . Aglietti, D. Jindal, P. Chen, et al. Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025

work page arXiv 2025

[39] [39]

Kipnis, K

A. Kipnis, K. V oudouris, L. M. S. Buschoff, and E. Schulz. metabench–a sparse benchmark to measure general ability in large language models.arXiv preprint arXiv:2407.12844, 2024

work page arXiv 2024

[40] [40]

E. Lawrie. Can ai therapists really be an alternative to human help?, 2025. URL https: //www.bbc.com/news/articles/ced2ywg7246o. Accessed: 2025-05-22

work page 2025

[41] [41]

LeCun, L

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

work page 1998

[42] [42]

X. Li, Y . Li, L. Liu, L. Bing, and S. Joty. Is gpt-3 a psychopath? evaluating large language models from a psychological perspective.arXiv preprint arXiv:2212.10529, 2022

work page arXiv 2022

[43] [43]

Holistic Evaluation of Language Models

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

R. W. Lissitz and K. Samuelsen. A suggested change in terminology and emphasis regarding validity and education.Educational researcher, 36(8):437–448, 2007

work page 2007

[45] [45]

R. T. McCoy, S. Yao, D. Friedman, M. Hardy, and T. L. Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve.arXiv preprint arXiv:2309.13638, 2023

work page arXiv 2023

[46] [46]

P. E. Meehl. Theoretical risks and tabular asterisks: Sir karl, sir ronald, and the slow progress of soft psychology. 1992

work page 1992

[47] [47]

Q. Mei, Y . Xie, W. Yuan, and M. O. Jackson. A turing test of whether ai chatbots are behaviorally similar to humans.Proceedings of the National Academy of Sciences, 121(9):e2313925121, 2024

work page 2024

[48] [48]

Meredith

W. Meredith. Measurement invariance, factor analysis and factorial invariance.Psychome- trika, 58(4):525–543, 1993. doi: 10.1007/BF02294825. URL https://doi.org/10.1007/ BF02294825

work page doi:10.1007/bf02294825 1993

[49] [49]

S. Merken. Judge fines lawyers in walmart lawsuit over fake, ai-generated cases.Reuters, 2025. URL https://www.reuters.com/legal/government/ judge-fines-lawyers-walmart-lawsuit-over-fake-ai-generated-cases-2025-02-25/

work page 2025

[50] [50]

Miller, K

J. Miller, K. Krauth, B. Recht, and L. Schmidt. The effect of natural distribution shift on question answering models. InInternational conference on machine learning, pages 6905–6916. PMLR, 2020

work page 2020

[51] [51]

J. P. Miller, R. Taori, A. Raghunathan, S. Sagawa, P. W. Koh, V . Shankar, P. Liang, Y . Carmon, and L. Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. InInternational conference on machine learning, pages 7721–7735. PMLR, 2021

work page 2021

[52] [52]

M. Miller. Chatbots are being sued for refusing to answer questions. is that a free speech issue?, Apr. 2024. URL https://mashable.com/article/chatbots-lawsuit-free-speech . Mashable. Accessed: 2025-05-13

work page 2024

[53] [53]

R. E. Nisbett, J. Aronson, C. Blair, W. Dickens, J. Flynn, D. F. Halpern, and E. Turkheimer. Intelligence: new findings and theoretical developments.American psychologist, 67(2):130, 2012

work page 2012

[54] [54]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303. 08774. URLhttps://doi.org/10.48550/arXiv.2303.08774. 13

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303 2023

[55] [55]

Gpt-4.1 overview

OpenAI. Gpt-4.1 overview. https://openai.com/index/gpt-4-1/, 2024. Accessed: 2025-04-14

work page 2024

[56] [56]

Introducing the openai academy, September 2024

OpenAI. Introducing the openai academy, September 2024. URL https://openai.com/ global-affairs/openai-academy/. Accessed: 2025-04-13

work page 2024

[57] [57]

Pearl.Causality

J. Pearl.Causality. Cambridge university press, 2009

work page 2009

[58] [58]

Popper.Logik der forschung

K. Popper.Logik der forschung. Akademie Verlag, 2013

work page 2013

[59] [59]

Rahwan, M

I. Rahwan, M. Cebrian, N. Obradovich, J. Bongard, J.-F. Bonnefon, C. Breazeal, J. W. Crandall, N. A. Christakis, I. D. Couzin, M. O. Jackson, et al. Machine behaviour.Nature, 568(7753): 477–486, 2019

work page 2019

[60] [60]

Recht, R

B. Recht, R. Roelofs, L. Schmidt, and V . Shankar. Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019

work page 2019

[61] [61]

W. R. Revelle. psych: Procedures for personality and psychological research. 2017

work page 2017

[62] [62]

K. Roose. Can a.i. be blamed for a teen’s suicide?, Oct. 2024. URL https://www.nytimes. com/2024/10/23/technology/characterai-lawsuit-teen-suicide.html . The New York Times. Accessed: 2025-05-13

work page 2024

[63] [63]

Y . Ruan, C. J. Maddison, and T. Hashimoto. Observational scaling laws and the predictability of language model performance, 2024.URL https://arxiv. org/abs/2405.10938

work page arXiv 2024

[64] [64]

C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019

work page 2019

[65] [65]

Salaudeen and M

O. Salaudeen and M. Hardt. Imagenot: A contrast with imagenet preserves model rankings. arXiv preprint arXiv:2404.02112, 2024

work page arXiv 2024

[66] [66]

Salaudeen, N

O. Salaudeen, N. Chiou, S. Weng, and S. Koyejo. Are domain generalization benchmarks with accuracy on the line misspecified?arXiv preprint arXiv:2504.00186, 2025

work page arXiv 2025

[67] [67]

Salaudeen, A

O. Salaudeen, A. Reuel, A. Ahmed, S. Bedi, Z. Robertson, S. Sundar, B. Domingue, A. Wang, and S. Koyejo. Measurement to meaning: A validity-centered framework for ai evaluation. preprint, 2025

work page 2025

[68] [68]

Santurkar, E

S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto. Whose opinions do language models reflect?arXiv preprint arXiv:2303.17548, 2023

work page arXiv 2023

[69] [69]

Z. Shi, N. Carlini, A. Balashankar, L. Schmidt, C.-J. Hsieh, A. Beutel, and Y . Qin. Effective robustness against natural distribution shifts for models with different training data.Advances in Neural Information Processing Systems, 36:73543–73558, 2023

work page 2023

[70] [70]

Singh, A

S. Singh, A. Romanou, C. Fourrier, D. I. Adelani, J. G. Ngui, D. Vila-Suero, P. Limkonchotiwat, K. Marchisio, W. Q. Leong, Y . Susanto, et al. Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation.arXiv preprint arXiv:2412.03304, 2024

work page arXiv 2024

[71] [71]

C. J. Soto and O. P. John. The next big five inventory (bfi-2): Developing and assessing a hierarchical model with 15 facets to enhance bandwidth, fidelity, and predictive power.Journal of Personality and Social psychology, 113(1):117, 2017

work page 2017

[72] [72]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[73] [73]

J. W. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7):1285–1295, 2024

work page 2024

[74] [74]

Strevens.The knowledge machine: How an unreasonable idea created modern science

M. Strevens.The knowledge machine: How an unreasonable idea created modern science. Penguin UK, 2020

work page 2020

[75] [75]

T. Sühr, F. E. Dorner, S. Samadi, and A. Kelava. Challenging the validity of personality tests for large language models.Preprint at arXiv. arXiv-2311 https://doi. org/10.48550/arXiv, 2311, 2023

work page internal anchor Pith review doi:10.48550/arxiv 2023

[76] [76]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022. 14

work page internal anchor Pith review Pith/arXiv arXiv 2022

[77] [77]

Teney, Y

D. Teney, Y . Lin, S. J. Oh, and E. Abbasnejad. Id and ood performance are sometimes inversely correlated on real-world datasets.Advances in Neural Information Processing Systems, 36: 71703–71722, 2023

work page 2023

[78] [78]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi- task benchmark and analysis platform for natural language understanding.arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[79] [79]

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding bench- mark. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024

[80] [80]

Wasilewski and M

E. Wasilewski and M. Jablonski. Measuring the perceived iq of multimodal large language models using standardized iq tests.Authorea Preprints, 2024

work page 2024