Human Psychometric Questionnaires Mischaracterize LLM Behavior
Pith reviewed 2026-05-18 17:39 UTC · model grok-4.3
The pith
LLM responses to human psychometric questionnaires substantially differ from their generation probabilities on real-world user queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For eight open-source LLMs, self-reported Likert scores from established questionnaires such as PVQ-40, PVQ-21, BFI-44, and BFI-10 differ substantially from generation probability scores of value- or personality-laden responses to real-world user queries. This difference supplies evidence that LLMs' answers to questionnaires reflect desired behavior rather than stable psychological constructs. The results also indicate that established questionnaires risk exaggerating demographic biases and that generation-based profiling offers a more reliable route to LLM psychometrics.
What carries the argument
Direct comparison of questionnaire-based self-reports against generation probability scores for laden responses to user queries; the comparison reveals the mismatch between the two profiling methods.
If this is right
- Established questionnaires risk exaggerating the demographic biases of LLMs.
- Psychological profiles derived from questionnaires should be interpreted with caution.
- Generation-based profiling is a more reliable approach to LLM psychometrics.
- Prior claims of consistent psychological dispositions in LLMs are challenged by the observed mismatch.
Where Pith is reading between the lines
- Future LLM evaluation could shift emphasis from direct self-report surveys to observing behavior in simulated user conversations.
- The divergence may show that training processes encourage LLMs to perform well on questionnaires without producing matching internal consistency across different contexts.
- Alignment researchers could apply similar generation-based checks to test whether safety training affects questionnaire answers more than actual output distributions.
Load-bearing premise
Generation probability scores of value- or personality-laden responses to real-world user queries accurately capture the LLMs' psychological characteristics expressed during interactions with users.
What would settle it
A new test that finds strong positive correlation between questionnaire scores and generation probabilities across a broad set of LLMs and query collections would undermine the claim that the two profiles are substantially different.
Figures
read the original abstract
We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper compares psychological profiles for eight open-source LLMs obtained from Likert-scale responses to established human questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10) against profiles derived from generation probability scores of value- or personality-laden responses to real-world user queries. It reports that the two profile types are substantially different, concluding that questionnaire responses reflect desired behavior rather than stable psychological constructs, thereby mischaracterizing LLM psychology, challenging prior claims of consistent dispositions, and risking exaggeration of demographic biases; generation-based profiling is positioned as more reliable.
Significance. If the central empirical discrepancy holds after addressing methodological gaps, the work would be significant for LLM evaluation and AI psychology research. It supplies a direct test of whether human-designed instruments capture interaction-relevant traits and offers an alternative generation-based approach. The use of multiple questionnaires and models provides breadth, though the result's impact hinges on validating the generation scores as a faithful proxy for stable dispositions.
major comments (2)
- [§3] §3 (Methods, generation probability scoring): The central claim interprets the discrepancy as evidence that questionnaires elicit 'desired behavior' rather than stable traits, but this requires that generation probability scores validly measure psychological characteristics expressed in user interactions. No independent validation is reported (e.g., correlation with human ratings of outputs, test-retest stability across query sets, or predictive validity for downstream behaviors), leaving the conclusion equally consistent with the generation method being unreliable or artifact-prone.
- [Results] Results section (profile comparison): The abstract and main text state that the two profiles 'turn out to be substantially different,' yet no quantitative metrics (correlation, cosine similarity, or statistical tests with sample sizes and controls) or tables reporting these values are described. Without such evidence, the magnitude and reliability of the difference cannot be assessed and the claim that questionnaires mischaracterize LLM psychology remains under-supported.
minor comments (2)
- [Abstract] Abstract: The phrase 'substantially different' would be clearer if accompanied by a brief indication of the metric or effect size used to quantify the difference.
- [Figures] Figure captions: Ensure all figures comparing profiles include axis labels, legend details, and any error information for reproducibility.
Simulated Author's Rebuttal
Thank you for the constructive review of our manuscript. We appreciate the opportunity to address the major comments and have revised the paper to strengthen the presentation of our methods and results.
read point-by-point responses
-
Referee: [§3] §3 (Methods, generation probability scoring): The central claim interprets the discrepancy as evidence that questionnaires elicit 'desired behavior' rather than stable traits, but this requires that generation probability scores validly measure psychological characteristics expressed in user interactions. No independent validation is reported (e.g., correlation with human ratings of outputs, test-retest stability across query sets, or predictive validity for downstream behaviors), leaving the conclusion equally consistent with the generation method being unreliable or artifact-prone.
Authors: We thank the referee for this important methodological point. The generation probability scores are obtained by computing the model's log-probabilities for producing value- or personality-aligned continuations to real-world user queries drawn from public interaction logs; this directly samples from the distribution the model uses during actual user interactions. While the original submission did not include external validation experiments (such as human ratings of generated outputs or test-retest checks), we maintain that the method provides a more ecologically valid proxy for expressed behavior than forced Likert responses. In the revision we have added a dedicated paragraph in the Methods section justifying the approach, explicitly stating its assumptions, and acknowledging the absence of independent validation as a limitation that future work should address. revision: partial
-
Referee: [Results] Results section (profile comparison): The abstract and main text state that the two profiles 'turn out to be substantially different,' yet no quantitative metrics (correlation, cosine similarity, or statistical tests with sample sizes and controls) or tables reporting these values are described. Without such evidence, the magnitude and reliability of the difference cannot be assessed and the claim that questionnaires mischaracterize LLM psychology remains under-supported.
Authors: We agree that quantitative metrics are required to support the claim of substantial differences. The original manuscript presented the profile comparisons primarily through visualizations and qualitative description. In the revised Results section we now include a table reporting Pearson correlations, cosine similarities, and results of paired statistical tests (with sample sizes, degrees of freedom, and multiple-comparison corrections) between the questionnaire-derived and generation-based profiles for each of the eight models and four questionnaires. These metrics confirm low correlations and statistically significant differences, providing the requested quantitative grounding for the conclusion. revision: yes
Circularity Check
Empirical comparison of questionnaire and generation profiles shows no reduction to fitted inputs or self-referential definitions
full rationale
The paper performs a direct empirical comparison between Likert-scale responses from established human psychometric questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10) and generation probability scores derived from value- or personality-laden responses to real-world user queries across eight open-source LLMs. The central observation—that the resulting profiles differ substantially—is presented as an empirical finding rather than a mathematical derivation. No equations, fitted parameters, or predictions are involved that reduce outputs to inputs by construction. Citations to prior work on LLM psychological dispositions are used to contextualize the challenge but do not serve as load-bearing uniqueness theorems or self-citation chains that justify the core claim. The analysis remains self-contained through data collection and profile comparison without circular redefinition or smuggling of ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Generation probability scores of value- or personality-laden responses accurately reflect LLMs' psychological characteristics in real interactions
Forward citations
Cited by 1 Pith paper
-
Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook
DOVE constructs a value codebook via rate-distortion variational optimization from 10K documents and measures LLM-human cultural alignment through unbalanced optimal transport, showing 31.56% correlation with downstre...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.