Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
Pith reviewed 2026-05-19 02:09 UTC · model grok-4.3
The pith
Human tests do not measure intelligence or other traits in AI systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that benchmark performance on human-designed tests for LLMs lacks sufficient theoretical and empirical justification to support interpretations as measurements of human-like traits. Human psychological and educational tests are theory-driven measurement instruments calibrated to a specific human population, so their direct application to non-human subjects without new validation risks mischaracterizing what is being measured.
What carries the argument
The requirement that measurement instruments receive empirical validation when transferred from their original calibrated human population to non-human AI subjects.
If this is right
- AI research should stop relying on unvalidated human tests such as IQ or personality inventories for evaluating models.
- New evaluation frameworks must be built specifically for AI, either by adapting psychometrics principles or creating them from scratch.
- Such frameworks would need to address known problems including data contamination, cultural bias, and sensitivity to prompt changes.
- Clearer tests would reduce the risk of misinterpreting what benchmark scores actually indicate about AI capabilities.
Where Pith is reading between the lines
- AI-specific tests could identify capabilities and failure modes that have no direct parallel in human trait categories.
- This shift might produce more reliable assessments for AI safety by focusing on actual system behaviors instead of assumed human analogies.
- Controlled experiments could check whether selected items from human tests show any transfer validity to AI under strict conditions.
Load-bearing premise
Human psychological and educational tests are valid measurement tools only for the human populations on which they were calibrated and cannot transfer directly to AI without separate validation studies.
What would settle it
An empirical study that applies a human test to AI systems, demonstrates through response patterns and external criteria that the scores measure the same underlying trait as in humans, and shows consistent validity across both populations.
Figures
read the original abstract
Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems. Such frameworks might build on existing frameworks for constructing and validating psychometrics tests, or could be created entirely from scratch to fit the unique context of AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a position paper arguing that interpreting LLM performance on human-designed psychological and educational tests (e.g., for intelligence or personality) as evidence of human-like traits constitutes an ontological error. These tests are theory-driven instruments calibrated to specific human populations; applying them directly to AI systems without empirical validation risks mischaracterizing what is measured. The authors cite known problems with current benchmarks (data contamination, cultural bias, prompt sensitivity) and call for developing principled AI-specific evaluation frameworks, which could either adapt psychometrics methods or be built from scratch.
Significance. If the position is adopted, the paper could encourage the AI evaluation community to prioritize validity and population-specific calibration over anthropomorphic interpretations of benchmarks. This would support more accurate assessment of AI capabilities on their own terms and reduce overclaiming based on human-normed instruments.
major comments (1)
- The central claim (abstract and opening sections) that human tests lack justification for AI use rests on construct-validity principles from psychometrics, but the manuscript provides no concrete illustration of how a specific test (e.g., an IQ or personality inventory) produces a mischaracterization when applied to an LLM; without such an example the argument remains at a high level of generality.
minor comments (2)
- The term 'ontological error' is used without a brief definition or reference; adding one sentence of clarification would improve accessibility for readers outside measurement theory.
- The call for new frameworks would be strengthened by citing at least two existing AI-specific evaluation proposals (e.g., from the literature on benchmark validity or agent testing) to show how the position relates to ongoing work.
Simulated Author's Rebuttal
We thank the referee for the constructive review and recommendation for minor revision. The feedback highlights an opportunity to make our position more concrete, and we address it directly below while committing to revisions that preserve the manuscript's scope as a position paper.
read point-by-point responses
-
Referee: The central claim (abstract and opening sections) that human tests lack justification for AI use rests on construct-validity principles from psychometrics, but the manuscript provides no concrete illustration of how a specific test (e.g., an IQ or personality inventory) produces a mischaracterization when applied to an LLM; without such an example the argument remains at a high level of generality.
Authors: We agree that an illustrative example would improve accessibility without altering the core argument, which rests on established psychometric principles that tests must be validated for the target population and construct. In the revised manuscript we will add a brief, literature-based illustration in the introduction: applying a Big Five personality inventory to LLMs, where prompt variations produce unstable trait profiles despite high average scores, demonstrating that the instrument fails to measure a stable construct in AI systems. This draws on documented prompt-sensitivity results rather than new experiments, keeping the paper focused on the position. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is a position paper that advances a conceptual argument based on established principles of psychometrics and measurement theory (construct validity, population-specific calibration). It contains no derivations, equations, fitted parameters, predictions, or self-citations that reduce any claim to its own inputs by construction. The central position follows directly from standard domain knowledge without internal reduction or load-bearing self-reference.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human psychological and educational tests are theory-driven measurement instruments calibrated to a specific human population.
Forward citations
Cited by 2 Pith papers
-
Machine individuality: Separating genuine idiosyncrasy from response bias in large language models
Crossed random-effects models on LLM word ratings show 16.9% variance from genuine stimulus-specific individuality, exceeding null models and forming coherent per-model fingerprints.
-
Position: Science of AI Evaluation Requires Item-level Benchmark Data
Item-level benchmark data is essential for rigorous AI evaluation because it enables fine-grained diagnostics and principled validation of benchmarks that aggregate scores cannot provide.
Reference graph
Works this paper leans on
-
[1]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliency maps.Advances in neural information processing systems, 31, 2018
work page 2018
-
[3]
American Educational Research Association, A. P. Association, and N. C. on Measurement in Education. Standards for educational and psychological testing, 2014
work page 2014
- [4]
- [5]
- [6]
- [7]
-
[8]
S. Barocas, M. Hardt, and A. Narayanan.Fairness and machine learning: Limitations and opportunities. MIT press, 2023
work page 2023
-
[9]
K. A. Bollen.Structural equations with latent variables, volume 210. John Wiley & Sons, 1989
work page 1989
-
[10]
D. Borsboom, G. J. Mellenbergh, and J. Van Heerden. The theoretical status of latent variables. Psychological review, 110(2):203, 2003
work page 2003
-
[11]
D. Borsboom, G. J. Mellenbergh, and J. Van Heerden. The concept of validity.Psychological review, 111(4):1061, 2004
work page 2004
-
[12]
C. J. Brainerd and J. Kingma. On the independence of short-term memory and working memory in cognitive development.Cognitive Psychology, 17(2):210–247, 1985
work page 1985
-
[13]
G. Caron and S. Srivastava. Manipulating the perceived personality traits of language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2370–2386, 2023. 11
work page 2023
-
[14]
M. Cheng, A. Y . Lee, K. Rapuano, K. Niederhoffer, A. Liebscher, and J. Hancock. From tools to thieves: Measuring and understanding public perceptions of ai through crowdsourced metaphors.arXiv preprint arXiv:2501.18045, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
L. M. Collins and S. T. Lanza.Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. John Wiley & Sons, 2009
work page 2009
-
[17]
N. Cowan. Evolving conceptions of memory storage, selective attention, and their mutual constraints within the human information-processing system.Psychological bulletin, 104(2): 163, 1988
work page 1988
-
[18]
L. J. Cronbach and P. E. Meehl. Construct validity in psychological tests.Psychological bulletin, 52(4):281, 1955
work page 1955
-
[19]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
work page 2009
-
[20]
R. Dominguez-Olmedo, M. Hardt, and C. Mendler-Dünner. Questioning the survey responses of large language models.arXiv preprint arXiv:2306.07951, 2023
-
[21]
Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024
R. Dominguez-Olmedo, F. E. Dorner, and M. Hardt. Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024
- [22]
-
[23]
R. W. Engle, S. W. Tuholski, J. E. Laughlin, and A. R. Conway. Working memory, short-term memory, and general fluid intelligence: a latent-variable approach.Journal of experimental psychology: General, 128(3):309, 1999
work page 1999
-
[24]
Feyerabend.Against method: Outline of an anarchistic theory of knowledge
P. Feyerabend.Against method: Outline of an anarchistic theory of knowledge. Verso Books, 2020
work page 2020
-
[25]
C. Fourrier, N. Habib, A. Lozovskaya, K. Szafer, and T. Wolf. Open llm leaderboard v2.https: //huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024
work page 2024
-
[26]
L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/ records...
- [27]
-
[28]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Changing answer order can decrease MMLU accuracy.arXiv preprint arXiv:2406.19470,
V . Gupta, D. Pantoja, C. Ross, A. Williams, and M. Ung. Changing answer order can decrease mmlu accuracy.arXiv preprint arXiv:2406.19470, 2024
-
[30]
M. Hardt. The emerging science of machine learning benchmarks, 2025
work page 2025
- [31]
-
[32]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[33]
J. Huang and O. Li. Measuring the iq of mainstream large language models in chinese using the wechsler adult intelligence scale.Authorea Preprints, 2024
work page 2024
- [34]
- [35]
-
[36]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[37]
D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo. Gpt-4 passes the bar exam.Philo- sophical Transactions of the Royal Society A, 382(2270):20230254, 2024
work page 2024
-
[38]
Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025
M. Kazemi, B. Fatemi, H. Bansal, J. Palowitch, C. Anastasiou, S. V . Mehta, L. K. Jain, V . Aglietti, D. Jindal, P. Chen, et al. Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025
- [39]
-
[40]
E. Lawrie. Can ai therapists really be an alternative to human help?, 2025. URL https: //www.bbc.com/news/articles/ced2ywg7246o. Accessed: 2025-05-22
work page 2025
- [41]
- [42]
-
[43]
Holistic Evaluation of Language Models
P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
R. W. Lissitz and K. Samuelsen. A suggested change in terminology and emphasis regarding validity and education.Educational researcher, 36(8):437–448, 2007
work page 2007
- [45]
-
[46]
P. E. Meehl. Theoretical risks and tabular asterisks: Sir karl, sir ronald, and the slow progress of soft psychology. 1992
work page 1992
-
[47]
Q. Mei, Y . Xie, W. Yuan, and M. O. Jackson. A turing test of whether ai chatbots are behaviorally similar to humans.Proceedings of the National Academy of Sciences, 121(9):e2313925121, 2024
work page 2024
-
[48]
W. Meredith. Measurement invariance, factor analysis and factorial invariance.Psychome- trika, 58(4):525–543, 1993. doi: 10.1007/BF02294825. URL https://doi.org/10.1007/ BF02294825
-
[49]
S. Merken. Judge fines lawyers in walmart lawsuit over fake, ai-generated cases.Reuters, 2025. URL https://www.reuters.com/legal/government/ judge-fines-lawyers-walmart-lawsuit-over-fake-ai-generated-cases-2025-02-25/
work page 2025
- [50]
-
[51]
J. P. Miller, R. Taori, A. Raghunathan, S. Sagawa, P. W. Koh, V . Shankar, P. Liang, Y . Carmon, and L. Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. InInternational conference on machine learning, pages 7721–7735. PMLR, 2021
work page 2021
-
[52]
M. Miller. Chatbots are being sued for refusing to answer questions. is that a free speech issue?, Apr. 2024. URL https://mashable.com/article/chatbots-lawsuit-free-speech . Mashable. Accessed: 2025-05-13
work page 2024
-
[53]
R. E. Nisbett, J. Aronson, C. Blair, W. Dickens, J. Flynn, D. F. Halpern, and E. Turkheimer. Intelligence: new findings and theoretical developments.American psychologist, 67(2):130, 2012
work page 2012
-
[54]
OpenAI. GPT-4 technical report.CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303. 08774. URLhttps://doi.org/10.48550/arXiv.2303.08774. 13
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303 2023
-
[55]
OpenAI. Gpt-4.1 overview. https://openai.com/index/gpt-4-1/, 2024. Accessed: 2025-04-14
work page 2024
-
[56]
Introducing the openai academy, September 2024
OpenAI. Introducing the openai academy, September 2024. URL https://openai.com/ global-affairs/openai-academy/. Accessed: 2025-04-13
work page 2024
- [57]
- [58]
- [59]
- [60]
-
[61]
W. R. Revelle. psych: Procedures for personality and psychological research. 2017
work page 2017
-
[62]
K. Roose. Can a.i. be blamed for a teen’s suicide?, Oct. 2024. URL https://www.nytimes. com/2024/10/23/technology/characterai-lawsuit-teen-suicide.html . The New York Times. Accessed: 2025-05-13
work page 2024
- [63]
-
[64]
C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019
work page 2019
-
[65]
O. Salaudeen and M. Hardt. Imagenot: A contrast with imagenet preserves model rankings. arXiv preprint arXiv:2404.02112, 2024
-
[66]
O. Salaudeen, N. Chiou, S. Weng, and S. Koyejo. Are domain generalization benchmarks with accuracy on the line misspecified?arXiv preprint arXiv:2504.00186, 2025
-
[67]
O. Salaudeen, A. Reuel, A. Ahmed, S. Bedi, Z. Robertson, S. Sundar, B. Domingue, A. Wang, and S. Koyejo. Measurement to meaning: A validity-centered framework for ai evaluation. preprint, 2025
work page 2025
-
[68]
S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto. Whose opinions do language models reflect?arXiv preprint arXiv:2303.17548, 2023
-
[69]
Z. Shi, N. Carlini, A. Balashankar, L. Schmidt, C.-J. Hsieh, A. Beutel, and Y . Qin. Effective robustness against natural distribution shifts for models with different training data.Advances in Neural Information Processing Systems, 36:73543–73558, 2023
work page 2023
-
[70]
S. Singh, A. Romanou, C. Fourrier, D. I. Adelani, J. G. Ngui, D. Vila-Suero, P. Limkonchotiwat, K. Marchisio, W. Q. Leong, Y . Susanto, et al. Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation.arXiv preprint arXiv:2412.03304, 2024
-
[71]
C. J. Soto and O. P. John. The next big five inventory (bfi-2): Developing and assessing a hierarchical model with 15 facets to enhance bandwidth, fidelity, and predictive power.Journal of Personality and Social psychology, 113(1):117, 2017
work page 2017
-
[72]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[73]
J. W. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7):1285–1295, 2024
work page 2024
-
[74]
Strevens.The knowledge machine: How an unreasonable idea created modern science
M. Strevens.The knowledge machine: How an unreasonable idea created modern science. Penguin UK, 2020
work page 2020
-
[75]
T. Sühr, F. E. Dorner, S. Samadi, and A. Kelava. Challenging the validity of personality tests for large language models.Preprint at arXiv. arXiv-2311 https://doi. org/10.48550/arXiv, 2311, 2023
work page internal anchor Pith review doi:10.48550/arxiv 2023
-
[76]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022. 14
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [77]
-
[78]
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi- task benchmark and analysis platform for natural language understanding.arXiv preprint arXiv:1804.07461, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[79]
Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding bench- mark. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
work page 2024
-
[80]
E. Wasilewski and M. Jablonski. Measuring the perceived iq of multimodal large language models using standardized iq tests.Authorea Preprints, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.