Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact
Pith reviewed 2026-06-26 17:31 UTC · model grok-4.3
The pith
Apparent psychological profiles of large language models arise mostly from directional response bias rather than the traits the tests target.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a battery of self-report and behavioral instruments, the authors show that differences between LLMs are driven not by targeted traits but by directional response bias; variance decomposition attributes 81-90% of between-model variation to this bias. The bias declines with capability but is not eliminated. Instrument reliability is almost entirely predicted by response orthogonality, and profiles shift with item selection or can be manufactured through it.
What carries the argument
Directional response bias (tendency to respond toward one end of the scale regardless of item content), isolated through variance decomposition within a formal psychometric framework applied to self-report and behavioral tasks.
If this is right
- Apparent reliability of any instrument for LLMs is almost entirely predicted by its response orthogonality.
- Model profiles shift with the items used and can be manufactured through item selection.
- The bias declines with model capability but is not eliminated by it.
- Instruments borrowed from human psychology rarely achieve full orthogonality and may lack validity for LLMs.
Where Pith is reading between the lines
- Dedicated LLM assessments should be built around response orthogonality rather than direct adoption of human scales.
- Using LLMs as proxies for human participants in psychological research may introduce systematic artifacts traceable to response bias.
- Similar directional biases could affect other survey-style evaluations of AI systems beyond personality and risk measures.
Load-bearing premise
Directional response bias can be cleanly separated from the targeted trait effects by the chosen instruments and the variance decomposition method.
What would settle it
A replication in which the same variance decomposition on new LLMs and instruments attributes most between-model variation to the intended traits rather than directional bias, or in which model profiles remain stable when items are swapped for others measuring the same trait.
read the original abstract
Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper administers a battery of personality and risk-preference instruments (self-report and behavioral) to 56 instruction-tuned LLMs and human reference samples. It reports four findings: (1) between-model differences are driven primarily by directional response bias (tendency toward one scale end or labeled option, independent of item content), with a variance decomposition attributing 81-90% of between-model variation to this bias versus 9-16% in humans; (2) the bias declines with model capability but is not eliminated; (3) instrument reliability is almost entirely predicted by 'response orthogonality' (proportion of items where trait and bias oppose); (4) apparent profiles shift with item selection and can be manufactured. The conclusion is that LLM psychological profiles are measurement artifacts, not model properties, and calls for new assessments centered on response orthogonality.
Significance. If the variance decomposition and orthogonality results hold under scrutiny, the work would substantially undermine the use of borrowed human psychological instruments for profiling LLMs in usability, safety, and research-proxy contexts. It supplies concrete empirical numbers from a large model sample and introduces a falsifiable construct (response orthogonality) that directly predicts reliability, offering a clear path for improved instrumentation.
major comments (2)
- [variance decomposition / findings 1] The central variance decomposition (abstract and § on findings) attributes 81-90% of between-model variation to directional bias under the assumption that bias is additive and orthogonal to trait effects. For LLMs this isolation is not obviously guaranteed, because next-token generation conditions on full item semantics and scale labels; any content-by-bias interaction would be misattributed to pure directional bias. Explicit residual diagnostics or simulation checks confirming orthogonality across the administered items are required to support the reported percentages.
- [response orthogonality definition and reliability prediction] Response orthogonality is defined and used to predict reliability (finding 3), yet the manuscript provides no direct test that the proportion of opposing items is independent of the specific trait scales or LLM prompting regime. If orthogonality itself covaries with item content or model capability, the predictive claim would be circular and the 81-90% bias attribution would be overstated.
minor comments (2)
- [methods] The abstract states four findings but the methods description of how bias was isolated from trait effects and how orthogonality was computed is referenced only at high level; a dedicated methods subsection with item lists, exact variance formulas, and human-LLM comparison tables would improve reproducibility.
- [finding 4] The claim that profiles 'can be manufactured through item selection' (finding 4) is important but would benefit from a supplementary table showing the exact item subsets and resulting profile shifts for at least two instruments.
Simulated Author's Rebuttal
We thank the referee for these constructive comments on the assumptions underlying our variance decomposition and the independence of the response orthogonality measure. We respond to each point below.
read point-by-point responses
-
Referee: [variance decomposition / findings 1] The central variance decomposition (abstract and § on findings) attributes 81-90% of between-model variation to directional bias under the assumption that bias is additive and orthogonal to trait effects. For LLMs this isolation is not obviously guaranteed, because next-token generation conditions on full item semantics and scale labels; any content-by-bias interaction would be misattributed to pure directional bias. Explicit residual diagnostics or simulation checks confirming orthogonality across the administered items are required to support the reported percentages.
Authors: We agree that the reported variance decomposition assumes additivity and orthogonality of directional bias to trait effects, and that next-token prediction could in principle introduce content-by-bias interactions. Our decomposition was obtained by fitting a linear model that separates an item-specific trait direction term from a model-specific directional bias term; the large between-model variance component attributed to bias is therefore conditional on that specification. To strengthen the claim, the revised manuscript will include (i) residual plots and formal tests for remaining interactions from the fitted models and (ii) Monte Carlo simulations that inject controlled content-by-bias interactions and recover the original attribution percentages. These additions will be placed in the methods and results sections alongside the existing decomposition. revision: yes
-
Referee: [response orthogonality definition and reliability prediction] Response orthogonality is defined and used to predict reliability (finding 3), yet the manuscript provides no direct test that the proportion of opposing items is independent of the specific trait scales or LLM prompting regime. If orthogonality itself covaries with item content or model capability, the predictive claim would be circular and the 81-90% bias attribution would be overstated.
Authors: Response orthogonality was computed on the same set of instruments administered to all 56 models, which already span multiple trait domains and a wide capability range; the reported reliability prediction held uniformly across those instruments. Nevertheless, we accept that an explicit check for covariance between orthogonality and either model capability or item features would further rule out circularity. The revision will therefore report (a) the correlation between orthogonality scores and model capability and (b) the correlation between orthogonality and item-level semantic features (e.g., length, polarity). If any modest covariance is detected, we will note its magnitude and re-estimate the reliability prediction after residualizing orthogonality on those covariates. revision: partial
Circularity Check
No significant circularity: empirical variance decomposition on administered instruments
full rationale
The paper's claims rest on direct administration of personality/risk instruments to 56 LLMs and human samples, followed by variance decomposition that partitions observed response variation into trait-targeted vs. directional bias components. The 81-90% bias attribution is an output of this decomposition applied to collected data, not a quantity defined in terms of itself or recovered by fitting parameters to the target result. 'Response orthogonality' is coined as a descriptive proportion of opposing items and then shown to empirically predict reliability; this is a post-hoc correlation on the data, not a self-definitional loop. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation. The analysis is self-contained against external benchmarks of instrument administration and standard psychometric variance partitioning.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Psychometric instruments designed for humans can be meaningfully administered to LLMs to decompose responses into trait and bias components
invented entities (1)
-
response orthogonality
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Accessed May 2026
OpenAI.: Scaling AI for Everyone. Accessed May 2026. https://openai.com/index/scaling-ai-for-e veryone/
2026
-
[2]
Training language models to follow instructions with human feedback
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. 2022;35:27730–27744
2022
-
[3]
Accessed May
Askell A, Carlsmith J, Olah C, Kaplan J, Karnofsky H, et al.: Claude’s Constitution. Accessed May
-
[4]
https://www.anthropic.com/constitution
-
[5]
Accessed May 2026
OpenAI.: Model Spec. Accessed May 2026. https://model-spec.openai.com/2025-02-12.html
2026
-
[6]
Anthropomorphism and Trust in Human-Large Language Model interactions
Kadambi A, D’Elia Y, Shah T, Comsa I, Lentz A, Siri-Ngammuang K, et al. Anthropomorphism and Trust in Human-Large Language Model interactions. arXiv preprint arXiv:260415316. 2026;https: //doi.org/10.48550/arXiv.2604.15316
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.15316 2026
-
[7]
Be Friendly, Not Friends: How LLM Sycophancy Shapes User Trust
Sun Y, Wang T. Be Friendly, Not Friends: How LLM Sycophancy Shapes User Trust. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems. 2026;https://doi.org/10.114 5/3772318.3791079
arXiv 2026
-
[8]
Hartley J, Hamill CB, Seddon D, Batra D, Okhrati R, Khraishi R. How Personality Traits Shape LLM Risk-Taking Behaviour. Findings of the Association for Computational Linguistics: ACL 2025. 2025;p. 21068–21092. https://doi.org/10.18653/v1/2025.findings-acl.1085
-
[9]
Psychometric Personality Shaping Mod- ulates Capabilities and Safety in Language Models
Fitz S, Romero P, Basart S, Chen S, Hernandez-Orallo J. Psychometric Personality Shaping Mod- ulates Capabilities and Safety in Language Models. arXiv preprint arXiv:250916332. 2025;https: //doi.org/10.48550/arXiv.2509.16332
-
[10]
Out of one, many: Using language models to simulate human samples
Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C, Wingate D. Out of one, many: Using language models to simulate human samples. Political Analysis. 2023;31(3):337–351. https://doi.org/10.101 7/pan.2023.2
2023
-
[11]
Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?
Horton JJ, Filippas A, Manning BS. Large language models as simulated economic agents: What can we learn from homo silicus? [Working Paper]. National Bureau of Economic Research. 2023;https: //doi.org/10.3386/w31122. 14
-
[12]
Pellert M, Lechner CM, Wagner C, Rammstedt B, Strohmaier M. AI psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science. 2024;19(5):808–826. https://doi.org/10.1177/17456916231214460
-
[13]
A psychometric framework for evaluating and shaping personality traits in large language models
Serapio-Garc´ ıa G, Safdari M, Crepy C, Sun L, Fitz S, Romero P, et al. A psychometric framework for evaluating and shaping personality traits in large language models. Nature Machine Intelligence. 2025;7(12):1954–1968. https://doi.org/10.1038/s42256-025-01115-6
-
[14]
Decision-making behavior evaluation framework for LLMs under uncertain context
Jia J, Yuan Z, Pan J, McNamara P, Chen D. Decision-making behavior evaluation framework for LLMs under uncertain context. Advances in Neural Information Processing Systems. 2024;37:113360– 113382. https://doi.org/10.52202/079017-3601
-
[15]
Moral Foundations of Large Language Models
Abdulhai M, Serapio-Garc´ ıa G, Crepy C, Valter D, Canny J, Jaques N. Moral Foundations of Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024;p. 17737–17752. https://doi.org/10.18653/v1/2024.emnlp-main.982
-
[16]
The political preferences of LLMs
Rozado D. The political preferences of LLMs. PloS one. 2024;19(7):e0306621. https://doi.org/10.1 371/journal.pone.0306621
2024
-
[17]
Large language model psychometrics: A systematic review of evaluation, validation, and enhancement
Ye H, Jin J, Xie Y, Zhang X, Song G. Large language model psychometrics: A systematic review of evaluation, validation, and enhancement. arXiv preprint arXiv:250508245. 2025;https://doi.org/10 .48550/arXiv.2505.08245
arXiv 2025
-
[18]
AIPsychoBench: Understanding the Psy- chometric Differences between LLMs and Humans
Xie W, Wang Z, Ma S, Sun X, Chen K, Wang E, et al. AIPsychoBench: Understanding the Psy- chometric Differences between LLMs and Humans. Topics in Cognitive Science. 2026;18(2):e70041. https://doi.org/10.1111/tops.70041
-
[19]
Self-Assessment Tests are Unreliable Measures of LLM Person- ality
Gupta A, Song X, Anumanchipalli G. Self-Assessment Tests are Unreliable Measures of LLM Person- ality. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024;p. 301–314. https://doi.org/10.18653/v1/2024.blackboxnlp-1.20
-
[20]
Shu B, Zhang L, Choi M, Dunagan L, Logeswaran L, Lee M, et al. You don’t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ...
-
[21]
Tjuatja L, Chen V, Wu T, Talwalkar A, Neubig G. Do LLMs exhibit human-like response biases? A case study in survey design. Transactions of the Association for Computational Linguistics. 2024;12:1011–1026. https://doi.org/10.1162/tacl a 00685
work page internal anchor Pith review doi:10.1162/tacl 2024
-
[22]
Cognitive phantoms in large language models through the lens of latent variables
Peereboom S, Schwabe I, Kleinberg B. Cognitive phantoms in large language models through the lens of latent variables. Computers in Human Behavior: Artificial Humans. 2025;4:100161. https: //doi.org/10.1016/j.chbah.2025.100161
-
[23]
S¨ uhr T, Dorner FE, Samadi S, Kelava A. Challenging the Validity of Personality Tests for Large Language Models. Proceedings of the 2025 Equity and Access in Algorithms, Mechanisms, and Optimization. 2025;p. 74–81. https://doi.org/10.1145/3757887.3763016
-
[24]
Jung J, Lutz M, Sen I, Strohmaier M. Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2026;p. 8143–8173. https://doi.org/10.18653/v1/2026.eacl-long.380
-
[25]
Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
S¨ uhr T, Dorner FE, Salaudeen O, Kelava A, Samadi S. Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. arXiv preprint arXiv:250723009. 2026;https: //doi.org/10.48550/arXiv.2507.23009
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.23009 2026
-
[26]
Yeasayers and naysayers: Agreeing response set as a personality variable
Couch A, Keniston K. Yeasayers and naysayers: Agreeing response set as a personality variable. The Journal of Abnormal and Social Psychology. 1960;60(2):151–174. https://doi.org/10.1037/h0040372. 15
-
[27]
Further evidence on response sets and test design
Cronbach LJ. Further evidence on response sets and test design. Educational and Psychological Measurement. 1950;10(1):3–31. https://doi.org/10.1177/001316445001000101
-
[28]
Coefficient alpha and the internal structure of tests
Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16(3):297–
1951
-
[29]
https://doi.org/10.1007/BF02310555
-
[30]
Response Styles in Marketing Research: A Cross-National Inves- tigation
Baumgartner H, Steenkamp JBEM. Response Styles in Marketing Research: A Cross-National Inves- tigation. Journal of Marketing Research. 2001;38(2):143–156. https://doi.org/10.1509/jmkr.38.2.14 3.18840
-
[31]
A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models
Goldberg LR. A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. Personality psychology in Europe. 1999;7(1):7–28
1999
-
[32]
Johnson JA. Measuring thirty facets of the Five Factor Model with a 120-item public domain inventory: Development of the IPIP-NEO-120. Journal of Research in Personality. 2014;51:78–89. https://doi.org/10.1016/j.jrp.2014.05.003
-
[33]
Risk preference shares the psychometric structure of major psychological traits
Frey R, Pedroni A, Mata R, Rieskamp J, Hertwig R. Risk preference shares the psychometric structure of major psychological traits. Science Advances. 2017;3(10):e1701381. https://doi.org/10 .1126/sciadv.1701381
2017
-
[34]
https://osf.io/tbmh5/
Johnson JA.: IPIP-NEO Data Repository. https://osf.io/tbmh5/. Open Science Framework
-
[35]
Efficient Guided Generation for Large Language Models
Willard BT, Louf R. Efficient Guided Generation for Large Language Models. arXiv preprint arXiv:230709702. 2023;https://doi.org/10.48550/arXiv.2307.09702
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09702 2023
-
[36]
A meta-analytic review of two modes of learning and the description-experience gap
Wulff DU, Mergenthaler-Canseco M, Hertwig R. A meta-analytic review of two modes of learning and the description-experience gap. Psychological bulletin. 2018;144(2):140. https://doi.org/10.103 7/bul0000115
2018
-
[37]
Positive and negative global self-esteem: A substantively meaningful distinction or artifactors? Journal of personality and social psychology
Marsh HW. Positive and negative global self-esteem: A substantively meaningful distinction or artifactors? Journal of personality and social psychology. 1996;70(4):810. https://doi.org/10.1037/ 0022-3514.70.4.810
1996
-
[38]
Misresponse to reversed and negated items in surveys: A review
Weijters B, Baumgartner H. Misresponse to reversed and negated items in surveys: A review. Journal of Marketing Research. 2012;49(5):737–747. https://doi.org/10.1509/jmr.11.0368
-
[39]
Ineffectiveness of Reverse Wording of Questionnaire Items: Let’s Learn from Cows in the Rain
Sonderen EV, Sanderman R, Coyne JC. Ineffectiveness of Reverse Wording of Questionnaire Items: Let’s Learn from Cows in the Rain. PLoS ONE. 2013;8(7):e68967. https://doi.org/10.1371/journa l.pone.0068967
-
[40]
This is not a dataset: A large negation benchmark to challenge large language models
Garc´ ıa-Ferrero I, Altuna B, Alvez J, Gonzalez-Dios I, Rigau G. This is not a dataset: A large negation benchmark to challenge large language models. Proceedings of the 2023 conference on empirical methods in natural language processing. 2023;p. 8596–8615. https://doi.org/10.18653/v1/2023.emn lp-main.531
-
[41]
Schmitt N, Stults DM. Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement. 1985;9(4):367–373. https://doi.org/10.1177/01466216850090 0405
-
[42]
Training data limits the prediction of consumer heterogeneity by LLM-based digital twins
Krefeld-Schwalb A, Johnson E, Weaver C, Wulff DU. Training data limits the prediction of consumer heterogeneity by LLM-based digital twins. OSF Preprints. 2026;https://doi.org/10.31234/osf.io/97 dya v1
-
[43]
Eberhardt ST, Vehlen A, Schaffrath J, Schwartz B, Baur T, Schiller D, et al. Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions. Scientific Reports. 2025;15(1):29541. https://doi.org/10.1038/s41598-025-14923-y
-
[44]
Evaluating and Inducing Personality in Pre-trained Language Models
Jiang G, Xu M, Zhu SC, Han W, Zhang C, Zhu Y. Evaluating and Inducing Personality in Pre-trained Language Models. Advances in Neural Information Processing Systems. 2023;36:10622–10643. 16
2023
-
[45]
The scientific value of numerical measures of human feelings
Kaiser C, Oswald AJ. The scientific value of numerical measures of human feelings. Proceedings of the National Academy of Sciences. 2022;119(42):e2210412119. https://doi.org/10.1073/pnas.22104 12119
-
[46]
Findings of the Association for Computational Linguistics: ACL 2023 , pages=
Perez E, Ringer S, Lukosiute K, Nguyen K, Chen E, Heiner S, et al. Discovering Language Model Behaviors with Model-Written Evaluations. Findings of the Association for Computational Linguistics: ACL 2023. 2023;p. 13387–13434. https://doi.org/10.18653/v1/2023.findings-acl.847
-
[47]
Towards Under- standing Sycophancy in Language Models
Sharma M, Tong M, Korbak T, Duvenaud D, Askell A, Bowman SR, et al. Towards Under- standing Sycophancy in Language Models. The Twelfth International Conference on Learning Representations. 2024;https://openreview.net/forum?id=tvhaxkMKAn
2024
-
[48]
Li A, Bagger J. Using the BIDR to distinguish the effects of impression management and self- deception on the criterion validity of personality measures: A meta-analysis. International Journal of Selection and Assessment. 2006;14(2):131–141. https://doi.org/10.1111/j.1468-2389.2006.00339.x
-
[49]
How human-like is LLM cognition? OSF Preprints
Hussain Z, Mata R, Wulff DU. How human-like is LLM cognition? OSF Preprints. 2026;https: //doi.org/10.31234/osf.io/2yvnt v2
-
[50]
Xu Q, Peng Y, Nastase SA, Chodorow M, Wu M, Li P. Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts. Nature human behaviour. 2025;9(9):1871–1886. https://doi.org/10.1038/s41562-025-02203-8
-
[51]
Failure of contextual invariance in large language models
Kumar S, Flint A, Aiello LM, Baronchelli A. Failure of contextual invariance in gender inference with large language models. arXiv preprint arXiv:260323485. 2026;https://doi.org/10.48550/arXiv .2603.23485
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2026
-
[52]
A rebuttal of two common deflationary stances against LLM cognition
Hussain Z, Mata R, Wulff DU. A rebuttal of two common deflationary stances against LLM cognition. Findings of the Association for Computational Linguistics. 2025;p. 24208–24213. https://doi.org/10 .18653/v1/2025.findings-acl.1242
2025
-
[53]
Kriegmair V, Wulff DU. Machine individuality: Separating genuine idiosyncrasy from response bias in large language models. arXiv preprint arXiv:260416755. 2026;https://doi.org/10.48550/arXiv.2 604.16755
-
[54]
AutoChunker: Structured text chunking and its evaluation, in: Rehm, G., Li, Y
Faulborn M, Sen I, Pellert M, Spitz A, Garcia D. Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025;p. 31684–31704. https://doi.org/10.18653/v1/2025 .acl-long.1529
-
[55]
A Domain-Specific Risk-Taking (DOSPERT) scale for adult populations
Blais AR, Weber EU. A Domain-Specific Risk-Taking (DOSPERT) scale for adult populations. Judgment and Decision Making. 2006;1(1):33–47. https://doi.org/10.1017/S1930297500000334
-
[56]
Evaluation of a behavioral measure of risk taking: The Balloon Analogue Risk Task (BART)
Lejuez CW, Read JP, Kahler CW, Richards JB, Ramsey SE, Stuart GL, et al. Evaluation of a behavioral measure of risk taking: The Balloon Analogue Risk Task (BART). Journal of Experimental Psychology: Applied. 2002;8(2):75–84. https://doi.org/10.1037/1076-898X.8.2.75
-
[57]
L¨ ohn L, Kiehne N, Ljapunov A, Balke WT. Is Machine Psychology here? On Requirements for Using Human Psychological Tests on Large Language Models. Proceedings of the 17th International Natural Language Generation Conference. 2024;p. 230–242. https://doi.org/10.18653/v1/2024.inl g-main.19
-
[58]
The behavioral and social sciences need open LLMs
Wulff DU, Hussain Z, Mata R. The behavioral and social sciences need open LLMs. OSF Preprints. 2024;https://doi.org/10.31219/osf.io/ybvzs
-
[59]
Post-training makes large language models less human-like
Binz M, Akata E, Almaatouq A, Alsobay M, Ariasov O, Br¨ andle F, et al. Post-training makes large language models less human-like. arXiv preprint arXiv:260507632. 2026;https://doi.org/10.48550/a rXiv.2605.07632. 17
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/a 2026
-
[60]
Psychometric Predictive Power of Large Language Models
Kuribayashi T, Oseki Y, Baldwin T. Psychometric Predictive Power of Large Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024;p. 1983–2005. https: //doi.org/10.18653/v1/2024.findings-naacl.129. 18
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.