pith. sign in

arxiv: 2606.09843 · v2 · pith:NQXHZ64Knew · submitted 2026-04-24 · 💻 cs.HC · cs.AI· cs.CL

An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models

Pith reviewed 2026-07-04 19:31 UTC · model glm-5.2

classification 💻 cs.HC cs.AIcs.CL
keywords LLM psychometricsself-report behavior gapLLM-as-judgefactor analysisalignmenttextual-surface biaspredictive validity
0
0 comments X

The pith

LLM self-reports don't predict behavior, even with native constructs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When LLMs answer personality questionnaires, their responses are stable and internally coherent. But those self-descriptions do not predict how the models actually behave in open-ended tasks. This paper tests whether that gap is an artifact of forcing human personality categories onto LLMs. The author built the first psychometric instrument whose dimensions are derived bottom-up from LLM behavior rather than borrowed from human psychology. Administering 300 items to 25 LLMs across 17 model families, exploratory factor analysis revealed five replicable, highly reliable factors: Responsiveness, Deference, Boldness, Guardedness, and Verbosity. The author then collected 2,500 open-ended behavioral samples and had them rated by 151 humans and a three-judge LLM ensemble. Humans and judges agreed about model behavior (mean r = .51), but self-report predicted neither: the gap persists even for constructs native to LLMs, where a human-mismatch explanation no longer applies. The one exception is Verbosity, which showed a weak but directionally consistent convergence with human ratings. A second finding concerns the LLM-as-judge method itself. On Responsiveness, self-report tracked LLM judges (r = .53) but not humans (r = .04), even though humans and judges otherwise agreed (r = .59). This pattern is mathematically incompatible with a single latent construct driving all three measurements, and forces a dual-loading account: judges and self-report items share textual-surface variance—cues of helpfulness, structure, and enthusiasm—that human observers do not weight as heavily. This confound is invisible to the within-ensemble reliability checks used to validate LLM judges.

Core claim

The central discovery is a dissociation between what LLMs say about themselves and what they do, demonstrated for the first time using constructs derived from LLM behavior rather than from human psychology. Five stable, replicable self-report factors emerged from bottom-up factor analysis, but these factors failed to predict how human raters perceived the same models' open-ended behavior, with the weak exception of Verbosity. A secondary discovery is a specific shared bias between LLM self-report items and LLM judges: both draw on textual-surface signals (structured formatting, enthusiastic framing) that human raters do not weight as heavily. This bias is undetectable by standard inter-judge

What carries the argument

The paper's central mechanism is a three-way comparison: self-report factor scores, human behavioral ratings, and LLM-judge behavioral ratings. The key diagnostic is the Responsiveness dissociation, where self-report correlates with LLM judges (r = .53) but not humans (r = .04), while humans and judges agree (r = .59). The author proves this pattern is incompatible with a single latent construct by showing the observed near-zero correlation falls below the lower bound implied by the product of the other two correlations (r ≈ .31). This forces a dual-loading account: self-report items and LLM judges share a source of variance (textual-surface cues) that human observers do not.

Load-bearing premise

The behavioral prompt set used to test predictive validity is small: 20 prompts (4 per factor) across 25 models. The paper itself acknowledges this limitation. When the analysis is restricted to the four prompts designed for each factor, even the weak Verbosity signal disappears, suggesting the observed convergence may be carried by broad aggregation rather than by prompts that adequately sample each factor's behavioral space.

What would settle it

Administer the instrument to 50+ models with a larger behavioral prompt battery (e.g., 20+ prompts per factor). If self-report scores reliably predict human behavioral ratings for most or all factors, the self-report-behavior gap would be substantially narrowed or closed.

Figures

Figures reproduced from arXiv: 2606.09843 by Juan Manuel Contreras.

Figure 1
Figure 1. Figure 1: Self-report profiles of a nine-model subset across the five AI-native factors (z-scores [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Self-report vs. human-rater profiles across the five factors, for all 25 models. Each row [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multitrait-multimethod matrix: Pearson correlations between self-report factor scores [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Big Five (BFI-44) profiles across the 25 models, shown as pool-relative z-scores. Extraver [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Self-report profiles for all 25 models across the five AI-native factors (z-scores relative to [PITH_FULL_IMAGE:figures/full_fig_p044_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Big Five profiles for all 25 models, z-scored within the pool. Extraversion is scored from forward-keyed items only (Efwd) due to acquiescence on reverse-keyed E items (§4.5). 45 [PITH_FULL_IMAGE:figures/full_fig_p045_6.png] view at source ↗
read the original abstract

Large language models (LLMs) give stable answers to personality questionnaires, yet these self-reports fail to predict how the models actually behave. Is this gap an artifact of forcing human trait categories onto LLMs, or something deeper about LLM self-report itself? To find out, we built the first psychometric instrument whose dimensions are derived bottom-up from LLM behavior rather than borrowed from human psychology. Administering 300 items (240 Likert + 60 scenario) to 25 LLMs across 17 model families, 30 times each, exploratory factor analysis revealed five replicable, highly reliable factors: Responsiveness, Deference, Boldness, Guardedness, and Verbosity (all Tucker $\phi \geq .957$, all $\alpha \geq .930$). We then collected 2,500 open-ended behavioral samples and had them rated by 151 humans and a three-judge LLM ensemble. Humans and judges agreed about model behavior ($\bar{r} = .51$), but self-report predicted neither: the gap persists even for constructs native to LLMs, where a human-mismatch explanation no longer applies. The exception is telling. On Responsiveness, self-report tracked LLM judges ($r = .53$) but not humans ($r = .04$), even though humans and judges otherwise agreed ($r = .59$). Self-report items and LLM judges share a source of variance that human observers do not. This confound is invisible to the within-ensemble reliability checks used to validate LLM judges, and it poses a concrete risk for the LLM-as-judge pipelines now central to model evaluation. We release the instrument as a diagnostic probe for alignment-shaped self-description.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 8 minor

Summary. This paper presents the first LLM-native psychometric instrument: 300 self-report items derived bottom-up from LLM behavioral affordances (rather than imported from human personality taxonomies), administered to 25 LLMs across 17 model families, 30 times each. Exploratory factor analysis on a preregistered split-half design yields five replicable factors (Responsiveness, Deference, Boldness, Guardedness, Verbosity; all Tucker φ ≥ .957, all α ≥ .930). The central predictive-validity finding is that self-report factor scores do not reliably predict how human raters (N=151) or an LLM-judge ensemble rate the same models' open-ended behavior (mean Instrument–Human r = .15, no factor-level CI excluding zero at N=25). The paper's most emphasized secondary finding is a Responsiveness dissociation: self-report correlates with LLM judges (r = .53) but not humans (r = .04), while humans and judges agree (r = .59), which the paper interprets as evidence of a shared textual-surface bias between LLM self-report and LLM judges. The instrument, raw data, and code are released.

Significance. The paper makes a genuine methodological contribution by constructing an LLM-native instrument via bottom-up factor analysis and validating it against human behavioral ratings—addressing a gap explicitly identified in recent surveys. The factor-structure analysis is well-executed: the preregistered split-half design, observation weighting, Tucker congruence checks, and the model-level robustness check (Appendix K, φ ≥ .990) are appropriate safeguards for the unconventional N=25, 240-item EFA. The release of the 100-item instrument, scoring rules, raw response data, and analysis code is a concrete strength. The finding that internal psychometric coherence does not translate to behavioral predictive validity—even when constructs are LLM-native—is important for the growing 'AI psychometrics' literature. However, the paper's most actionable claim about LLM-as-judge bias (the Responsiveness dissociation) is overstated relative to what the statistical evidence supports at N=25, as detailed below.

major comments (3)
  1. §4.6.3, Table 7, and §5.3: The claim that the Responsiveness dissociation (r_SJ = .53, r_SH = .04, r_HJ = .59) is 'mathematically incompatible with a single-factor model' is not supported at the reported sample size. The bound r_SH ≥ r_SJ × r_HJ ≈ .31 holds for population correlations, but with N=25 and the reported 95% CIs, the lower bounds are r_SJ ≥ .30 and r_HJ ≥ .06, yielding a product as low as .018—well within the r_SH CI of [-.33, +.34]. The human–judge CI for Responsiveness [.06, .86] is extremely wide and barely excludes zero. The paper does not report a formal test (e.g., SEM comparison of single-factor vs. dual-loading models, or a bootstrap of the product r_SJ × r_HJ minus r_SH). Without such a test, the 'mathematical incompatibility' and the language 'demands a dual-loading account' (§5.3) are not justified. This is load-bearing because the dissociation is the basis for the
  2. §4.6.3 and §5.6: The behavioral prompt set (n=20, 4 per factor) is acknowledged as 'relatively small' (§5.6), but the paper does not adequately address the threat this poses to the central null result. The on-target analysis (§4.6.3) shows that restricting to factor-targeted prompts removes even the weak Verbosity signal (on-target mean r = −.03 vs. all-prompts r = .15), meaning the weak convergence that exists is carried by broad aggregation rather than by prompts designed to elicit each factor. If the behavioral prompts do not adequately sample each factor's construct space, the null predictive-validity result could reflect prompt poverty rather than a genuine self-report–behavior decoupling. The paper should either (a) explicitly scope the central claim to 'the 20 prompts tested' rather than 'behavior' broadly, or (b) provide evidence that 4 prompts per factor provide adequatecoverage
  3. §4.6.3, Table 7: The abstract states 'self-report predicted neither [humans nor judges]' but Table 7 shows the mean Instrument–Judge correlation is r = .17 with CI [.06, .29], which excludes zero. The Responsiveness factor alone (r = .53, CI [.30, .72]) drives this. The abstract's blanket claim that self-report does not predict judge ratings is contradicted by the aggregate CI. The paper should reconcile the abstract with the table, either by scoping the claim to human ratings (where no CI excludes zero) or by acknowledging the judge convergence on Responsiveness as a partial exception rather than dismissing it entirely.
minor comments (8)
  1. §3.3.3: The observation weighting scheme (weight 1/15 per row, effective N=25) is described clearly, but the paper does not report whether standard errors or CIs in the factor-structure analysis account for this weighting. Clarifying whether the EFA standard errors reflect the effective N or the raw 375 observations would help readers calibrate the factor-stability claims.
  2. Table 6: The flag column uses '†' for factors below r=.65, but the threshold is described as 'preregistered r=.65 threshold' in the text. The table header should clarify whether this refers to mean r or ICC, as both are reported.
  3. §4.5.1: The decision to use forward-keyed Extraversion only (α_fwd = .932 vs. full-scale α = .167) is reasonable but should note that this subscale is 8 items, not the standard 8-item E subscale from BFI-44 (which includes reverse-keyed items). The item count should be specified.
  4. Figure 2 caption: 'Self-report and human ratings converge most tightly on Verbosity and Guardedness' — but Table 7 shows Guardedness r = .27 (CI [−.10, +.64]) and Verbosity r = .41 (CI [−.10, +.70]), neither excluding zero. The caption overstates the convergence; consider 'show the largest point estimates for convergence' instead.
  5. §5.3: 'This is a concrete empirical demonstration that LLM-as-judge ratings can look validated against text-based criteria (like self-report) while failing to track the human judgments they are meant to proxy' — this is a single-factor (Responsiveness) finding at N=25 with wide CIs. The generalization to plural 'ratings' and 'criteria' should be tempered.
  6. Appendix B: Several per-model α values are negative (e.g., Llama 4 Maverick Responsiveness α = −0.86). While the text explains this reflects high determinism, a brief note in the table caption would prevent misreading.
  7. §3.2.1: The 13th candidate dimension (Sensitivity to Criticism) merged into Social Alignment is mentioned in a footnote. Consider moving this to the main text for visibility, as it affects the item-generation design.
  8. References: Several arXiv preprints are cited with future dates (e.g., Yang et al., 2026; Gao et al., 2026). Verify these are not placeholder dates.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for a careful and constructive review. The referee raises three major comments: (1) the 'mathematical incompatibility' claim for the Responsiveness dissociation is not formally tested and may be unjustified at N=25; (2) the small behavioral prompt set (n=20) threatens the central null result; and (3) the abstract's blanket claim that self-report predicted neither humans nor judges is contradicted by Table 7's judge convergence. We agree with comments (1) and (3) and will revise accordingly. On comment (2), we agree the claim should be scoped but disagree that prompt poverty is the most likely explanation, for reasons we detail below.

read point-by-point responses
  1. Referee: The claim that the Responsiveness dissociation is 'mathematically incompatible with a single-factor model' is not supported at N=25. The bound r_SH >= r_SJ x r_HJ holds for population correlations, but with N=25 and the reported CIs, the product of lower bounds can be as low as .018, well within the r_SH CI. No formal test (SEM or bootstrap of the product) is reported. The language 'demands a dual-loading account' is not justified.

    Authors: The referee is correct. The triangle inequality bound r_SH >= r_SJ x r_HJ applies to population correlations, and at N=25 the confidence intervals are wide enough that the observed pattern is not formally incompatible with a single-factor model. We did not report a formal SEM comparison or a bootstrap test of the product r_SJ x r_HJ minus r_SH, and without such a test the language 'mathematically incompatible' and 'demands a dual-loading account' overstates what the evidence supports. We will revise the manuscript in three ways: (1) replace 'mathematically incompatible' with language that accurately characterizes the pattern as suggestive of but not a formal test of dual-loading structure; (2) add a bootstrap test of the product r_SJ x r_HJ minus r_SH to quantify how unusual the observed gap is under the null; and (3) soften 'demands a dual-loading account' to 'is consistent with a dual-loading account but does not rule out a single-factor model at this sample size.' We appreciate the referee catching this overstatement. revision: yes

  2. Referee: The behavioral prompt set (n=20, 4 per factor) is acknowledged as 'relatively small' but the paper does not adequately address the threat this poses to the central null result. The on-target analysis shows that restricting to factor-targeted prompts removes even the weak Verbosity signal, meaning the weak convergence is carried by broad aggregation. The null could reflect prompt poverty rather than genuine decoupling. The paper should either scope the central claim to 'the 20 prompts tested' or provide evidence that 4 prompts per factor provide adequate coverage.

    Authors: We agree that the paper should explicitly scope the central null claim to the 20 prompts tested rather than to 'behavior' broadly, and we will revise the abstract, Section 5.3, and Section 5.6 accordingly. However, we partially disagree with the stronger concern that prompt poverty is a plausible alternative explanation for the null, for three reasons. First, the on-target analysis does not merely attenuate the signal—it reverses the sign of Responsiveness (r = -.45, CI excluding zero), which is the opposite of what prompt poverty would predict; inadequate sampling of a construct should produce noise around zero, not a reliable negative correlation. Second, the human-judge agreement on the same 20 prompts is substantial (mean r = .51, four of five factor-level CIs excluding zero), demonstrating that the prompts do elicit discriminable behavioral variation that both rating systems can detect—if the prompts were too impoverished to capture factor-relevant behavior, human-judge agreement should also collapse. Third, the two factors with the most directly observable behavioral signatures (Verbosity and Guardedness) show the largest convergent point estimates, consistent with a gradient of observability rather than a gradient of prompt adequacy. That said, we acknowledge that 4 prompts per factor is insufficient to claim comprehensive construct coverage, and we cannot rule out that a larger, more carefully designed prompt battery would recover stronger convergence. We will add this as an explicit limitation and scope all claims to 'the 20 behavioral prompts tested.' revision: partial

  3. Referee: The abstract states 'self-report predicted neither [humans nor judges]' but Table 7 shows the mean Instrument-Judge correlation is r = .17 with CI [.06, .29], which excludes zero. The Responsiveness factor alone (r = .53, CI [.30, .72]) drives this. The abstract's blanket claim is contradicted by the aggregate CI.

    Authors: The referee is correct. The aggregate Instrument-Judge correlation (r = .17, CI [.06, .29]) does exclude zero, driven primarily by the Responsiveness factor (r = .53). The abstract's statement that 'self-report predicted neither' is inaccurate as applied to judge ratings. We will revise the abstract to state that self-report did not predict human ratings (no factor-level CI excluding zero) but showed partial convergence with LLM-judge ratings, concentrated on Responsiveness. This revision is also consistent with the paper's own framing of the Responsiveness dissociation as 'the exception that is telling'—the abstract should reflect that exception rather than contradicting it. We will ensure the abstract, Section 4.6.3, and the conclusion are all reconciled with Table 7. revision: yes

Circularity Check

0 steps flagged

No significant circularity: central null result is structurally non-circular; behavioral prompts designed post-factor-extraction bias against, not toward, the null finding

full rationale

The paper's central claim — that LLM self-report does not predict behavior — is a null result with independently derived inputs (self-report Likert scores) and outputs (human and LLM-judge ratings of open-ended behavioral samples). There is no self-definitional reduction. The one design choice that could introduce circularity is that behavioral prompts (§3.2.3) and the judge rating instrument were designed using the same five factor definitions extracted from the self-report data. However, this design biases toward finding self-report–behavior convergence, not against it: if the behavioral prompts and rating criteria share construct definitions with the self-report items, we would expect elevated correlations, yet the paper finds near-zero convergence. The on-target analysis (§4.6.3) confirms this: restricting to factor-targeted prompts yields worse convergence (r̄=−.03) than the broad aggregate (r̄=.15). The Responsiveness dissociation claim (r_SH ≥ r_SJ × r_HJ) relies on a standard single-factor correlation bound, not a self-citation, and its concern is statistical validity at N=25 (point estimates treated as population parameters), not circularity. No load-bearing self-citations were identified. Score of 1 reflects the minor shared-instrument design choice that is non-circular in effect.

Axiom & Free-Parameter Ledger

4 free parameters · 4 axioms · 2 invented entities

The instrument introduces four design choices that function as free parameters (k=5, item thresholds, prompt count, judge composition), all justified but not parameter-free. The four domain assumptions are the load-bearing axioms: the logprob–sampling equivalence (validated for 7/25 models), human rater reliability (modest ICCs), seed-dimension coverage (literature-based but not exhaustive), and factor-analysis adequacy at N=25 (defended via robustness checks but acknowledged as unconventional). No invented entities are postulated without falsifiable handles.

free parameters (4)
  • Number of factors (k=5) = 5
    Parallel analysis suggested 19 factors; the authors overrode this and selected k=5 based on scree elbow, interpretability, and Tucker congruence (§4.2.1, Deviations from Preregistration). This is a post-hoc choice from the data, though validated on a held-out half.
  • Item retention thresholds = primary loading ≥.40, cross-loading <.30
    These thresholds determined which 100 of 240 items were retained (§3.4.1). They are conventional psychometric thresholds, not fitted to this dataset, but they shape the final instrument.
  • Behavioral prompt set (n=20) = 20 prompts (4 per factor)
    The 20 behavioral prompts were designed after factor extraction to target each factor's construct space (§3.2.3). The choice of 4 prompts per factor is not justified by power analysis and limits the predictive-validity test.
  • Judge ensemble composition = Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro
    Three judges selected as 'state-of-the-art' (§3.4.5). The choice is reasonable but not parameter-free; different judge compositions could yield different agreement patterns.
axioms (4)
  • domain assumption LLM responses to Likert items at temperature 1.0 across 30 runs approximate the model's probability-weighted expected score.
    Validated for 7 models with log-probability access (mean r=.999 between repeated-sampling and logprob scores, §4.5.3), but assumed for the remaining 18 models.
  • domain assumption Human raters on Prolific can reliably rate LLM behavioral outputs on abstract constructs like Responsiveness and Boldness.
    Item-level ICCs are modest (.18–.43, §4.6.1), below conventional thresholds for individual-decision reliability. The paper argues model-level aggregation absorbs this noise, but the axiom that human ratings carry real signal at N=25 rests on the human–judge agreement (mean r=.51).
  • domain assumption The 12 candidate dimensions used for item generation adequately sample the space of LLM behavioral variation.
    The 12 seed dimensions were identified by 'reasoning from documented LLM behaviors in the alignment, safety, and evaluation literatures' (§3.2.1). If important behavioral dimensions were omitted, the factor structure may be incomplete.
  • domain assumption Factor analysis at model-level N=25 with 240 items yields a trustworthy covariance structure.
    The paper acknowledges this is 'unconventional by the subject-to-item ratios developed for human psychometrics' (§3.3.3) and relies on four complementary checks. The model-level robustness check (Appendix K) is the strongest defense, but N=25 remains small for factor generalization to new models.
invented entities (2)
  • Five LLM-native self-report factors (Responsiveness, Deference, Boldness, Guardedness, Verbosity) independent evidence
    purpose: To organize LLM self-description along behaviorally grounded dimensions
    The factors are derived from EFA on a purpose-built item pool and validated via split-half replication (Tucker ϕ≥.957), model-level robustness (ϕ≥.990), and cross-run stability (r≥.965). They make a falsifiable prediction: that the same structure should emerge in an independent sample of LLMs.
  • Textual-surface bias (shared variance between LLM self-report and LLM judges) independent evidence
    purpose: To explain the Responsiveness dissociation (self-report–judge r=.53, self-report–human r=.04)
    The bias is inferred from a mathematical inconsistency with a single-factor model (r_IJ × r_HJ ≈ .31 as lower bound, observed r_IH = .04). It makes a falsifiable prediction: that controlling for textual-surface features (formatting, enthusiasm) should reduce the self-report–judge correlation. The paper does not test this directly but proposes it as future work (§5.7).

pith-pipeline@v1.1.0-glm · 39311 in / 4115 out tokens · 256699 ms · 2026-07-04T19:31:10.922162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 10 internal anchors

  1. [1]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a labora...

  2. [2]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  3. [3]

    Bhandari, U

    P. Bhandari, U. Naseem, A. Datta, N. Fay, and M. Nasim. Evaluating personality traits in large language models: Insights from psychological questionnaires. In Companion Proceedings of the ACM Web Conference 2025, 2025. URL https://arxiv.org/abs/2502.05248

  4. [4]

    Art or artifice? L arge language models and the false promise of creativity

    Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, and Chien-Sheng Wu. Art or artifice? L arge language models and the false promise of creativity. arXiv preprint arXiv:2309.14556, 2024. URL https://arxiv.org/abs/2309.14556

  5. [5]

    OR-Bench : An over-refusal benchmark for large language models

    Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-Bench : An over-refusal benchmark for large language models. arXiv preprint arXiv:2405.20947, 2025. URL https://arxiv.org/abs/2405.20947

  6. [6]

    Dorner, Tom S \"u hr, Samira Samadi, and Augustin Kelava

    Florian E. Dorner, Tom S \"u hr, Samira Samadi, and Augustin Kelava. Do personality tests generalize to large language models? In Socially Responsible Language Modelling Research (SoLaR) Workshop at NeurIPS, 2023

  7. [7]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Bal \'a zs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled AlpacaEval : A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2025. URL https://arxiv.org/abs/2404.04475

  8. [8]

    On the creativity of large language models

    Giorgio Franceschelli and Mirco Musolesi. On the creativity of large language models. AI & Society, 40 0 (5): 0 3785--3795, 2024. URL https://link.springer.com/article/10.1007/s00146-024-02127-3

  9. [9]

    D. C. Funder. On the accuracy of personality judgment: A realistic approach. Psychological Review, 102 0 (4): 0 652--670, 1995

  10. [10]

    Evaluating and mitigating llm-as-a-judge bias in communication systems, 2026

    Jiaxin Gao, Chen Chen, Yanwen Jia, Xueluan Gong, Kwok-Yan Lam, and Qian Wang. Evaluating and mitigating llm-as-a-judge bias in communication systems, 2026. URL https://arxiv.org/abs/2510.12462

  11. [11]

    Self-assessment tests are unreliable measures of llm personality, 2024

    Akshat Gupta, Xiaoyang Song, and Gopala Anumanchipalli. Self-assessment tests are unreliable measures of llm personality, 2024. URL https://arxiv.org/abs/2309.08163

  12. [12]

    Computing inter-rater reliability for observational data: an overview and tutorial

    Kevin A Hallgren. Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in quantitative methods for psychology, 8 0 (1): 0 23, 2012

  13. [13]

    T. F. Heston and J. Gillette. Large language models demonstrate distinct personality profiles. Cureus, 17 0 (5): 0 e84706, 2025. URL https://doi.org/10.7759/cureus.84706

  14. [14]

    Jiang, X

    H. Jiang, X. Zhang, X. Cao, J. Kabbara, and D. Roy. PersonalityChat : Conversation personalization through personality. In NeurIPS 2023, 2023

  15. [15]

    O. P. John, L. P. Naumann, and C. J. Soto. Paradigm shift to the integrative B ig F ive trait taxonomy. In O. P. John, R. W. Robins, and L. A. Pervin, editors, Handbook of personality: Theory and research, pages 114--158. 3rd edition, 1999

  16. [16]

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Samuel R. Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna K...

  17. [17]

    Content analysis: An introduction to its methodology

    Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage publications, 2018

  18. [18]

    Lee et al

    S. Lee et al. TRAIT : A psychometric tool for evaluating LLM personality traits. In NAACL Findings 2025, 2025

  19. [19]

    Li, J.-t

    H. Li, J.-t. Huang, H. Wang, H. Cheng, W. Zhang, X. Zou, and L. Sun. Evaluating large language models with psychometrics. arXiv preprint arXiv:2406.17675, 2024. URL https://arxiv.org/abs/2406.17675

  20. [20]

    Decoding LLM personality measurement: Forced-choice vs

    Xiaoyu Li, Haoran Shi, Zengyi Yu, Yukun Tu, and Chanjin Zheng. Decoding LLM personality measurement: Forced-choice vs. L ikert. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 9234--9247, Vienna, Austria, July 2025. Association for Computati...

  21. [21]

    Lorenzo-Seva and J

    U. Lorenzo-Seva and J. M. F. ten Berge. Tucker's congruence coefficient as a meaningful index of factor similarity. Methodology, 2 0 (2): 0 57--64, 2006

  22. [22]

    Maharjan, R

    J. Maharjan, R. Jin, J. Zhu, and D. Kenne. Psychometric evaluation of large language model embeddings for personality trait prediction. Journal of Medical Internet Research, 27: 0 e75347, 2025. URL https://doi.org/10.2196/75347

  23. [23]

    H. W. Marsh, A. J. Morin, P. D. Parker, and G. Kaur. Exploratory structural equation modeling: An integration of the best features of exploratory and confirmatory factor analysis. Annual Review of Clinical Psychology, 10: 0 85--110, 2014

  24. [24]

    J. Musek. A general factor of personality: Evidence for the B ig O ne in the five-factor model. Journal of Research in Personality, 41 0 (6): 0 1213--1233, 2007

  25. [25]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  26. [26]

    Palan and C

    S. Palan and C. Schitter. Prolific.ac--- A subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17: 0 22--27, 2018

  27. [27]

    E. Peer, D. Rothschild, A. Gordon, Z. Evernden, and E. Damer. Data quality of platforms and panels for online behavioral research. Behavior Research Methods, 54 0 (4): 0 1643--1662, 2022

  28. [28]

    Cognitive phantoms in large language models through the lens of latent variables

    Sanne Peereboom, Inga Schwabe, and Bennett Kleinberg. Cognitive phantoms in large language models through the lens of latent variables. Computers in Human Behavior: Artificial Humans, 4: 0 100161, May 2025. ISSN 2949-8821. doi:10.1016/j.chbah.2025.100161. URL http://dx.doi.org/10.1016/j.chbah.2025.100161

  29. [29]

    Pellert, C

    M. Pellert, C. M. Lechner, C. Wagner, B. Rammstedt, and M. Strohmaier. AI psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science, 19 0 (5): 0 808--826, 2024. URL https://doi.org/10.1177/17456916231214460

  30. [30]

    Discovering Language Model Behaviors with Model-Written Evaluations

    Ethan Perez et al. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022. URL https://arxiv.org/abs/2212.09251

  31. [31]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    Paul R \"o ttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest : A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2024. URL https://arxiv.org/abs/2308.01263

  32. [32]

    Verbosity bias in preference labeling by large language models

    Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076, 2023. URL https://arxiv.org/abs/2310.10076

  33. [33]

    Salecha, M

    A. Salecha, M. E. Ireland, S. Subramanya, J. Sedoc, L. H. Ungar, and J. C. Eichstaedt. Large language models display human-like social desirability biases in B ig F ive personality surveys. PNAS Nexus, 3 0 (12): 0 pgae533, 2024. URL https://doi.org/10.1093/pnasnexus/pgae533

  34. [34]

    Serapio-Garc \'i a, M

    G. Serapio-Garc \'i a, M. Safdari, C. Crepy, L. Sun, S. Fitz, P. Romero, and M. Matari \'c . Personality traits in large language models. Nature Machine Intelligence, 2025

  35. [35]

    Towards Understanding Sycophancy in Language Models

    Megha Sharma et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023. URL https://arxiv.org/abs/2310.13548

  36. [36]

    Judging the judges: A systematic study of position bias in llm-as-a-judge, 2025

    Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge, 2025. URL https://arxiv.org/abs/2406.07791

  37. [37]

    Intraclass correlations: uses in assessing rater reliability

    Patrick E Shrout and Joseph L Fleiss. Intraclass correlations: uses in assessing rater reliability. Psychological bulletin, 86 0 (2): 0 420, 1979

  38. [38]

    Suzuki and T

    R. Suzuki and T. Arita. An evolutionary model of personality traits related to cooperative behavior using a large language model. Scientific Reports, 14: 0 5989, 2024. URL https://doi.org/10.1038/s41598-024-55903-y

  39. [39]

    S. Vazire. Who knows what about a person? T he self--other knowledge asymmetry ( SOKA ) model. Journal of Personality and Social Psychology, 98 0 (2): 0 281--300, 2010

  40. [40]

    Y. Wang, J. Zhao, D. S. Ones, L. He, and X. Xu. Evaluating the ability of large language models to emulate personality. Scientific Reports, 15: 0 519, 2025. URL https://doi.org/10.1038/s41598-024-84109-5

  41. [41]

    Self-Preference Bias in LLM-as-a-Judge

    Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge, 2025. URL https://arxiv.org/abs/2410.21819

  42. [42]

    Z. Wen, S. Yang, Z. Cao, Q. Sun, J. Yang, and Y. Liu. Self-assessment, exhibition, and recognition: A review of personality in large language models. arXiv preprint arXiv:2406.17624, 2024. URL https://arxiv.org/abs/2406.17624

  43. [43]

    W. Xie, S. Ma, Z. Wang, et al. AIPsychoBench : Understanding the psychometric differences between LLMs and humans. arXiv preprint arXiv:2509.16530, 2025. URL https://arxiv.org/abs/2509.16530

  44. [44]

    On calibration of large language models: From response to capability

    Sin-Han Yang, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, and Shao-Hua Sun. On calibration of large language models: From response to capability. arXiv preprint arXiv:2602.13540, 2026. URL https://arxiv.org/abs/2602.13540

  45. [45]

    H. Ye, J. Jin, Y. Xie, X. Zhang, and G. Song. Large language model psychometrics: A systematic review of evaluation, validation, and enhancement. arXiv preprint arXiv:2505.08245, 2026. URL https://arxiv.org/abs/2505.08245

  46. [46]

    Zheng, W.-L

    L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, and I. Stoica. Judging LLM -as-a-judge with MT-Bench and Chatbot Arena . In NeurIPS 2023, 2023

  47. [47]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2020. URL https://arxiv.org/abs/1909.08593