pith. sign in

arxiv: 2507.23009 · v2 · submitted 2025-07-30 · 💻 cs.LG · cs.AI

Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

Pith reviewed 2026-05-19 02:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords AI evaluationbenchmarkspsychometricslarge language modelsmeasurement validityontological errorAI-specific tests
0
0 comments X

The pith

Human tests do not measure intelligence or other traits in AI systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors argue that standardized tests for intelligence, personality, and other traits in humans cannot be directly used on AI systems. These tests are developed based on theories of human cognition and validated through studies on human populations. Applying them to large language models without fresh validation for AI risks producing scores that do not correspond to the traits they claim to assess. The position paper therefore calls for halting this practice and instead building evaluation methods designed specifically around the properties of artificial systems. Such a change would help clarify what current benchmarks actually reveal about AI performance.

Core claim

The central claim is that benchmark performance on human-designed tests for LLMs lacks sufficient theoretical and empirical justification to support interpretations as measurements of human-like traits. Human psychological and educational tests are theory-driven measurement instruments calibrated to a specific human population, so their direct application to non-human subjects without new validation risks mischaracterizing what is being measured.

What carries the argument

The requirement that measurement instruments receive empirical validation when transferred from their original calibrated human population to non-human AI subjects.

If this is right

  • AI research should stop relying on unvalidated human tests such as IQ or personality inventories for evaluating models.
  • New evaluation frameworks must be built specifically for AI, either by adapting psychometrics principles or creating them from scratch.
  • Such frameworks would need to address known problems including data contamination, cultural bias, and sensitivity to prompt changes.
  • Clearer tests would reduce the risk of misinterpreting what benchmark scores actually indicate about AI capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • AI-specific tests could identify capabilities and failure modes that have no direct parallel in human trait categories.
  • This shift might produce more reliable assessments for AI safety by focusing on actual system behaviors instead of assumed human analogies.
  • Controlled experiments could check whether selected items from human tests show any transfer validity to AI under strict conditions.

Load-bearing premise

Human psychological and educational tests are valid measurement tools only for the human populations on which they were calibrated and cannot transfer directly to AI without separate validation studies.

What would settle it

An empirical study that applies a human test to AI systems, demonstrates through response patterns and external criteria that the scores measure the same underlying trait as in humans, and shows consistent validity across both populations.

Figures

Figures reproduced from arXiv: 2507.23009 by Augustin Kelava, Florian E. Dorner, Olawale Salaudeen, Samira Samadi, Tom S\"uhr.

Figure 1
Figure 1. Figure 1: Overview of the measurement instrument creation process using a personality test as an example. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Measurement models of two latent variables [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems. Such frameworks might build on existing frameworks for constructing and validating psychometrics tests, or could be created entirely from scratch to fit the unique context of AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is a position paper arguing that interpreting LLM performance on human-designed psychological and educational tests (e.g., for intelligence or personality) as evidence of human-like traits constitutes an ontological error. These tests are theory-driven instruments calibrated to specific human populations; applying them directly to AI systems without empirical validation risks mischaracterizing what is measured. The authors cite known problems with current benchmarks (data contamination, cultural bias, prompt sensitivity) and call for developing principled AI-specific evaluation frameworks, which could either adapt psychometrics methods or be built from scratch.

Significance. If the position is adopted, the paper could encourage the AI evaluation community to prioritize validity and population-specific calibration over anthropomorphic interpretations of benchmarks. This would support more accurate assessment of AI capabilities on their own terms and reduce overclaiming based on human-normed instruments.

major comments (1)
  1. The central claim (abstract and opening sections) that human tests lack justification for AI use rests on construct-validity principles from psychometrics, but the manuscript provides no concrete illustration of how a specific test (e.g., an IQ or personality inventory) produces a mischaracterization when applied to an LLM; without such an example the argument remains at a high level of generality.
minor comments (2)
  1. The term 'ontological error' is used without a brief definition or reference; adding one sentence of clarification would improve accessibility for readers outside measurement theory.
  2. The call for new frameworks would be strengthened by citing at least two existing AI-specific evaluation proposals (e.g., from the literature on benchmark validity or agent testing) to show how the position relates to ongoing work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and recommendation for minor revision. The feedback highlights an opportunity to make our position more concrete, and we address it directly below while committing to revisions that preserve the manuscript's scope as a position paper.

read point-by-point responses
  1. Referee: The central claim (abstract and opening sections) that human tests lack justification for AI use rests on construct-validity principles from psychometrics, but the manuscript provides no concrete illustration of how a specific test (e.g., an IQ or personality inventory) produces a mischaracterization when applied to an LLM; without such an example the argument remains at a high level of generality.

    Authors: We agree that an illustrative example would improve accessibility without altering the core argument, which rests on established psychometric principles that tests must be validated for the target population and construct. In the revised manuscript we will add a brief, literature-based illustration in the introduction: applying a Big Five personality inventory to LLMs, where prompt variations produce unstable trait profiles despite high average scores, demonstrating that the instrument fails to measure a stable construct in AI systems. This draws on documented prompt-sensitivity results rather than new experiments, keeping the paper focused on the position. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a position paper that advances a conceptual argument based on established principles of psychometrics and measurement theory (construct validity, population-specific calibration). It contains no derivations, equations, fitted parameters, predictions, or self-citations that reduce any claim to its own inputs by construction. The central position follows directly from standard domain knowledge without internal reduction or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The argument rests on standard assumptions from psychometrics and measurement theory rather than new postulates or fitted values.

axioms (1)
  • domain assumption Human psychological and educational tests are theory-driven measurement instruments calibrated to a specific human population.
    Invoked directly in the abstract to ground the claim that application to AI constitutes an ontological error without empirical validation.

pith-pipeline@v0.9.0 · 5749 in / 1218 out tokens · 28411 ms · 2026-05-19T02:09:59.855838+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Machine individuality: Separating genuine idiosyncrasy from response bias in large language models

    cs.AI 2026-04 unverdicted novelty 7.0

    Crossed random-effects models on LLM word ratings show 16.9% variance from genuine stimulus-specific individuality, exceeding null models and forming coherent per-model fingerprints.

  2. Position: Science of AI Evaluation Requires Item-level Benchmark Data

    cs.AI 2026-02 unverdicted novelty 5.0

    Item-level benchmark data is essential for rigorous AI evaluation because it enables fine-grained diagnostics and principled validation of benchmarks that aggregate scores cannot provide.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 2 Pith papers · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Adebayo, J

    J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliency maps.Advances in neural information processing systems, 31, 2018

  3. [3]

    American Educational Research Association, A. P. Association, and N. C. on Measurement in Education. Standards for educational and psychological testing, 2014

  4. [4]

    Construct

    American Psychological Association. Construct. https://dictionary.apa.org/ construct, n.d. Accessed: 2025-05-20

  5. [5]

    Anthropic

    A. Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1:1, 2024

  6. [6]

    Baddeley

    A. Baddeley. Psychology of learning and motivation.(No Title), 8:47, 1974

  7. [7]

    Baddeley

    A. Baddeley. Working memory.Current biology, 20(4):R136–R140, 2010

  8. [8]

    Barocas, M

    S. Barocas, M. Hardt, and A. Narayanan.Fairness and machine learning: Limitations and opportunities. MIT press, 2023

  9. [9]

    K. A. Bollen.Structural equations with latent variables, volume 210. John Wiley & Sons, 1989

  10. [10]

    Borsboom, G

    D. Borsboom, G. J. Mellenbergh, and J. Van Heerden. The theoretical status of latent variables. Psychological review, 110(2):203, 2003

  11. [11]

    Borsboom, G

    D. Borsboom, G. J. Mellenbergh, and J. Van Heerden. The concept of validity.Psychological review, 111(4):1061, 2004

  12. [12]

    C. J. Brainerd and J. Kingma. On the independence of short-term memory and working memory in cognitive development.Cognitive Psychology, 17(2):210–247, 1985

  13. [13]

    Caron and S

    G. Caron and S. Srivastava. Manipulating the perceived personality traits of language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2370–2386, 2023. 11

  14. [14]

    From tools to thieves: Measuring and understanding public perceptions of AI through crowdsourced metaphors

    M. Cheng, A. Y . Lee, K. Rapuano, K. Niederhoffer, A. Liebscher, and J. Hancock. From tools to thieves: Measuring and understanding public perceptions of ai through crowdsourced metaphors.arXiv preprint arXiv:2501.18045, 2025

  15. [15]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  16. [16]

    L. M. Collins and S. T. Lanza.Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. John Wiley & Sons, 2009

  17. [17]

    N. Cowan. Evolving conceptions of memory storage, selective attention, and their mutual constraints within the human information-processing system.Psychological bulletin, 104(2): 163, 1988

  18. [18]

    L. J. Cronbach and P. E. Meehl. Construct validity in psychological tests.Psychological bulletin, 52(4):281, 1955

  19. [19]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  20. [20]

    Dominguez-Olmedo, M

    R. Dominguez-Olmedo, M. Hardt, and C. Mendler-Dünner. Questioning the survey responses of large language models.arXiv preprint arXiv:2306.07951, 2023

  21. [21]

    Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

    R. Dominguez-Olmedo, F. E. Dorner, and M. Hardt. Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

  22. [22]

    Dwork, M

    C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012

  23. [23]

    R. W. Engle, S. W. Tuholski, J. E. Laughlin, and A. R. Conway. Working memory, short-term memory, and general fluid intelligence: a latent-variable approach.Journal of experimental psychology: General, 128(3):309, 1999

  24. [24]

    Feyerabend.Against method: Outline of an anarchistic theory of knowledge

    P. Feyerabend.Against method: Outline of an anarchistic theory of knowledge. Verso Books, 2020

  25. [25]

    Fourrier, N

    C. Fourrier, N. Habib, A. Lozovskaya, K. Szafer, and T. Wolf. Open llm leaderboard v2.https: //huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024

  26. [26]

    L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/ records...

  27. [27]

    A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y . Zhao, X. Du, M. R. G. Madani, et al. Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024

  28. [28]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  29. [29]

    Changing answer order can decrease MMLU accuracy.arXiv preprint arXiv:2406.19470,

    V . Gupta, D. Pantoja, C. Ross, A. Williams, and M. Ung. Changing answer order can decrease mmlu accuracy.arXiv preprint arXiv:2406.19470, 2024

  30. [30]

    M. Hardt. The emerging science of machine learning benchmarks, 2025

  31. [31]

    Hardt, E

    M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning.Advances in neural information processing systems, 29, 2016

  32. [32]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  33. [33]

    Huang and O

    J. Huang and O. Li. Measuring the iq of mainstream large language models in chinese using the wechsler adult intelligence scale.Authorea Preprints, 2024

  34. [34]

    Huang, W

    J.-t. Huang, W. Wang, E. J. Li, M. H. LAM, S. Ren, Y . Yuan, W. Jiao, Z. Tu, and M. Lyu. On the humanity of conversational ai: Evaluating the psychological portrayal of llms. InThe Twelfth International Conference on Learning Representations, 2023. 12

  35. [35]

    Jiang, M

    G. Jiang, M. Xu, S.-C. Zhu, W. Han, C. Zhang, and Y . Zhu. Evaluating and inducing personality in pre-trained language models. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=I9xE1Jsjfx

  36. [36]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  37. [37]

    D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo. Gpt-4 passes the bar exam.Philo- sophical Transactions of the Royal Society A, 382(2270):20230254, 2024

  38. [38]

    Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025

    M. Kazemi, B. Fatemi, H. Bansal, J. Palowitch, C. Anastasiou, S. V . Mehta, L. K. Jain, V . Aglietti, D. Jindal, P. Chen, et al. Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025

  39. [39]

    Kipnis, K

    A. Kipnis, K. V oudouris, L. M. S. Buschoff, and E. Schulz. metabench–a sparse benchmark to measure general ability in large language models.arXiv preprint arXiv:2407.12844, 2024

  40. [40]

    E. Lawrie. Can ai therapists really be an alternative to human help?, 2025. URL https: //www.bbc.com/news/articles/ced2ywg7246o. Accessed: 2025-05-22

  41. [41]

    LeCun, L

    Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

  42. [42]

    X. Li, Y . Li, L. Liu, L. Bing, and S. Joty. Is gpt-3 a psychopath? evaluating large language models from a psychological perspective.arXiv preprint arXiv:2212.10529, 2022

  43. [43]

    Holistic Evaluation of Language Models

    P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022

  44. [44]

    R. W. Lissitz and K. Samuelsen. A suggested change in terminology and emphasis regarding validity and education.Educational researcher, 36(8):437–448, 2007

  45. [45]

    R. T. McCoy, S. Yao, D. Friedman, M. Hardy, and T. L. Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve.arXiv preprint arXiv:2309.13638, 2023

  46. [46]

    P. E. Meehl. Theoretical risks and tabular asterisks: Sir karl, sir ronald, and the slow progress of soft psychology. 1992

  47. [47]

    Q. Mei, Y . Xie, W. Yuan, and M. O. Jackson. A turing test of whether ai chatbots are behaviorally similar to humans.Proceedings of the National Academy of Sciences, 121(9):e2313925121, 2024

  48. [48]

    Meredith

    W. Meredith. Measurement invariance, factor analysis and factorial invariance.Psychome- trika, 58(4):525–543, 1993. doi: 10.1007/BF02294825. URL https://doi.org/10.1007/ BF02294825

  49. [49]

    S. Merken. Judge fines lawyers in walmart lawsuit over fake, ai-generated cases.Reuters, 2025. URL https://www.reuters.com/legal/government/ judge-fines-lawyers-walmart-lawsuit-over-fake-ai-generated-cases-2025-02-25/

  50. [50]

    Miller, K

    J. Miller, K. Krauth, B. Recht, and L. Schmidt. The effect of natural distribution shift on question answering models. InInternational conference on machine learning, pages 6905–6916. PMLR, 2020

  51. [51]

    J. P. Miller, R. Taori, A. Raghunathan, S. Sagawa, P. W. Koh, V . Shankar, P. Liang, Y . Carmon, and L. Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. InInternational conference on machine learning, pages 7721–7735. PMLR, 2021

  52. [52]

    M. Miller. Chatbots are being sued for refusing to answer questions. is that a free speech issue?, Apr. 2024. URL https://mashable.com/article/chatbots-lawsuit-free-speech . Mashable. Accessed: 2025-05-13

  53. [53]

    R. E. Nisbett, J. Aronson, C. Blair, W. Dickens, J. Flynn, D. F. Halpern, and E. Turkheimer. Intelligence: new findings and theoretical developments.American psychologist, 67(2):130, 2012

  54. [54]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303. 08774. URLhttps://doi.org/10.48550/arXiv.2303.08774. 13

  55. [55]

    Gpt-4.1 overview

    OpenAI. Gpt-4.1 overview. https://openai.com/index/gpt-4-1/, 2024. Accessed: 2025-04-14

  56. [56]

    Introducing the openai academy, September 2024

    OpenAI. Introducing the openai academy, September 2024. URL https://openai.com/ global-affairs/openai-academy/. Accessed: 2025-04-13

  57. [57]

    Pearl.Causality

    J. Pearl.Causality. Cambridge university press, 2009

  58. [58]

    Popper.Logik der forschung

    K. Popper.Logik der forschung. Akademie Verlag, 2013

  59. [59]

    Rahwan, M

    I. Rahwan, M. Cebrian, N. Obradovich, J. Bongard, J.-F. Bonnefon, C. Breazeal, J. W. Crandall, N. A. Christakis, I. D. Couzin, M. O. Jackson, et al. Machine behaviour.Nature, 568(7753): 477–486, 2019

  60. [60]

    Recht, R

    B. Recht, R. Roelofs, L. Schmidt, and V . Shankar. Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019

  61. [61]

    W. R. Revelle. psych: Procedures for personality and psychological research. 2017

  62. [62]

    K. Roose. Can a.i. be blamed for a teen’s suicide?, Oct. 2024. URL https://www.nytimes. com/2024/10/23/technology/characterai-lawsuit-teen-suicide.html . The New York Times. Accessed: 2025-05-13

  63. [63]

    Y . Ruan, C. J. Maddison, and T. Hashimoto. Observational scaling laws and the predictability of language model performance, 2024.URL https://arxiv. org/abs/2405.10938

  64. [64]

    C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019

  65. [65]

    Salaudeen and M

    O. Salaudeen and M. Hardt. Imagenot: A contrast with imagenet preserves model rankings. arXiv preprint arXiv:2404.02112, 2024

  66. [66]

    Salaudeen, N

    O. Salaudeen, N. Chiou, S. Weng, and S. Koyejo. Are domain generalization benchmarks with accuracy on the line misspecified?arXiv preprint arXiv:2504.00186, 2025

  67. [67]

    Salaudeen, A

    O. Salaudeen, A. Reuel, A. Ahmed, S. Bedi, Z. Robertson, S. Sundar, B. Domingue, A. Wang, and S. Koyejo. Measurement to meaning: A validity-centered framework for ai evaluation. preprint, 2025

  68. [68]

    Santurkar, E

    S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto. Whose opinions do language models reflect?arXiv preprint arXiv:2303.17548, 2023

  69. [69]

    Z. Shi, N. Carlini, A. Balashankar, L. Schmidt, C.-J. Hsieh, A. Beutel, and Y . Qin. Effective robustness against natural distribution shifts for models with different training data.Advances in Neural Information Processing Systems, 36:73543–73558, 2023

  70. [70]

    Singh, A

    S. Singh, A. Romanou, C. Fourrier, D. I. Adelani, J. G. Ngui, D. Vila-Suero, P. Limkonchotiwat, K. Marchisio, W. Q. Leong, Y . Susanto, et al. Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation.arXiv preprint arXiv:2412.03304, 2024

  71. [71]

    C. J. Soto and O. P. John. The next big five inventory (bfi-2): Developing and assessing a hierarchical model with 15 facets to enhance bandwidth, fidelity, and predictive power.Journal of Personality and Social psychology, 113(1):117, 2017

  72. [72]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022

  73. [73]

    J. W. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7):1285–1295, 2024

  74. [74]

    Strevens.The knowledge machine: How an unreasonable idea created modern science

    M. Strevens.The knowledge machine: How an unreasonable idea created modern science. Penguin UK, 2020

  75. [75]

    T. Sühr, F. E. Dorner, S. Samadi, and A. Kelava. Challenging the validity of personality tests for large language models.Preprint at arXiv. arXiv-2311 https://doi. org/10.48550/arXiv, 2311, 2023

  76. [76]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022. 14

  77. [77]

    Teney, Y

    D. Teney, Y . Lin, S. J. Oh, and E. Abbasnejad. Id and ood performance are sometimes inversely correlated on real-world datasets.Advances in Neural Information Processing Systems, 36: 71703–71722, 2023

  78. [78]

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi- task benchmark and analysis platform for natural language understanding.arXiv preprint arXiv:1804.07461, 2018

  79. [79]

    Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding bench- mark. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  80. [80]

    Wasilewski and M

    E. Wasilewski and M. Jablonski. Measuring the perceived iq of multimodal large language models using standardized iq tests.Authorea Preprints, 2024

Showing first 80 references.