pith. sign in

arxiv: 2509.10078 · v4 · pith:AO7CP7YRnew · submitted 2025-09-12 · 💻 cs.CL · cs.AI

Human Psychometric Questionnaires Mischaracterize LLM Behavior

Pith reviewed 2026-05-18 17:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM psychometricsquestionnaire validitygeneration behaviorpersonality profilingvalue assessmentAI psychologymodel biashuman-AI interaction
0
0 comments X

The pith

LLM responses to human psychometric questionnaires substantially differ from their generation probabilities on real-world user queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two ways of profiling the psychology of eight open-source LLMs. One way uses standard human questionnaires that ask the model to rate statements about values or personality traits on a Likert scale. The other way measures how likely the model is to generate value-laden or personality-laden answers when responding to actual user queries drawn from real interactions. The two resulting profiles turn out to be substantially different. A sympathetic reader would care because many existing studies describe LLMs as having stable personalities or values based solely on questionnaire answers; if those answers do not match how the models actually generate text with users, the descriptions are unreliable. The work therefore challenges earlier claims that LLMs possess consistent psychological dispositions.

Core claim

For eight open-source LLMs, self-reported Likert scores from established questionnaires such as PVQ-40, PVQ-21, BFI-44, and BFI-10 differ substantially from generation probability scores of value- or personality-laden responses to real-world user queries. This difference supplies evidence that LLMs' answers to questionnaires reflect desired behavior rather than stable psychological constructs. The results also indicate that established questionnaires risk exaggerating demographic biases and that generation-based profiling offers a more reliable route to LLM psychometrics.

What carries the argument

Direct comparison of questionnaire-based self-reports against generation probability scores for laden responses to user queries; the comparison reveals the mismatch between the two profiling methods.

If this is right

  • Established questionnaires risk exaggerating the demographic biases of LLMs.
  • Psychological profiles derived from questionnaires should be interpreted with caution.
  • Generation-based profiling is a more reliable approach to LLM psychometrics.
  • Prior claims of consistent psychological dispositions in LLMs are challenged by the observed mismatch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future LLM evaluation could shift emphasis from direct self-report surveys to observing behavior in simulated user conversations.
  • The divergence may show that training processes encourage LLMs to perform well on questionnaires without producing matching internal consistency across different contexts.
  • Alignment researchers could apply similar generation-based checks to test whether safety training affects questionnaire answers more than actual output distributions.

Load-bearing premise

Generation probability scores of value- or personality-laden responses to real-world user queries accurately capture the LLMs' psychological characteristics expressed during interactions with users.

What would settle it

A new test that finds strong positive correlation between questionnaire scores and generation probabilities across a broad set of LLMs and query collections would undermine the claim that the two profiles are substantially different.

Figures

Figures reproduced from arXiv: 2509.10078 by Dongmin Choi, Eun-Ju Lee, Jongwook Han, Woojung Song, Yohan Jo, Yoonah Park.

Figure 1
Figure 1. Figure 1: Prompt template for Value Portrait items [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Bar chart showing the average confidence interval width across 10 models at the value dimension level. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hero-villain score differences across constructs. Points to the right of zero indicate heroes score higher, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average absolute differences in value scores across demographic contrasts (male vs. female, religious [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Value Portrait prompt version 2. Prompt Template for Value Portrait items from human-LLM conversations Now I will briefly describe a message and re￾sponse. Please read them and tell me how simi￾lar this response is to your own thoughts. Please answer, even if you are not completely sure of your response. Message: {text} Response: {content} IMPORTANT: Your response must contain ONLY ONE of these exact phras… view at source ↗
Figure 5
Figure 5. Figure 5: Value Portrait prompt version 1. Value Portrait consists of human-LLM conver￾sations (ShareGPT and LMSYS (Zheng et al., 2024)) and human-human advisory contexts (Reddit (Lourie et al., 2021) and Dear Abby archives). Items from human-LLM conversations and human-human advisory contexts require dif￾ferent prompts. Items sourced from human-LLM conversations lack titles and consist of direct user queries to LLM… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for Value Portrait items [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for BFI items. Prompt Template for Established Question￾naires Item-Construct Recognition You are an expert in psychology. Question: “{question_text}” Available {construct_type}: {all_items} Which ONE of these {construct_type} does this question primarily measure? Choose the single best match. Respond with only the exact name from the list above [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template for item-construct recog [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt template for item-construct recog [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: System prompts used to assign demographic [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Bar chart showing the average confidence interval width across 10 models at the trait dimension level. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
read the original abstract

We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper compares psychological profiles for eight open-source LLMs obtained from Likert-scale responses to established human questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10) against profiles derived from generation probability scores of value- or personality-laden responses to real-world user queries. It reports that the two profile types are substantially different, concluding that questionnaire responses reflect desired behavior rather than stable psychological constructs, thereby mischaracterizing LLM psychology, challenging prior claims of consistent dispositions, and risking exaggeration of demographic biases; generation-based profiling is positioned as more reliable.

Significance. If the central empirical discrepancy holds after addressing methodological gaps, the work would be significant for LLM evaluation and AI psychology research. It supplies a direct test of whether human-designed instruments capture interaction-relevant traits and offers an alternative generation-based approach. The use of multiple questionnaires and models provides breadth, though the result's impact hinges on validating the generation scores as a faithful proxy for stable dispositions.

major comments (2)
  1. [§3] §3 (Methods, generation probability scoring): The central claim interprets the discrepancy as evidence that questionnaires elicit 'desired behavior' rather than stable traits, but this requires that generation probability scores validly measure psychological characteristics expressed in user interactions. No independent validation is reported (e.g., correlation with human ratings of outputs, test-retest stability across query sets, or predictive validity for downstream behaviors), leaving the conclusion equally consistent with the generation method being unreliable or artifact-prone.
  2. [Results] Results section (profile comparison): The abstract and main text state that the two profiles 'turn out to be substantially different,' yet no quantitative metrics (correlation, cosine similarity, or statistical tests with sample sizes and controls) or tables reporting these values are described. Without such evidence, the magnitude and reliability of the difference cannot be assessed and the claim that questionnaires mischaracterize LLM psychology remains under-supported.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'substantially different' would be clearer if accompanied by a brief indication of the metric or effect size used to quantify the difference.
  2. [Figures] Figure captions: Ensure all figures comparing profiles include axis labels, legend details, and any error information for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review of our manuscript. We appreciate the opportunity to address the major comments and have revised the paper to strengthen the presentation of our methods and results.

read point-by-point responses
  1. Referee: [§3] §3 (Methods, generation probability scoring): The central claim interprets the discrepancy as evidence that questionnaires elicit 'desired behavior' rather than stable traits, but this requires that generation probability scores validly measure psychological characteristics expressed in user interactions. No independent validation is reported (e.g., correlation with human ratings of outputs, test-retest stability across query sets, or predictive validity for downstream behaviors), leaving the conclusion equally consistent with the generation method being unreliable or artifact-prone.

    Authors: We thank the referee for this important methodological point. The generation probability scores are obtained by computing the model's log-probabilities for producing value- or personality-aligned continuations to real-world user queries drawn from public interaction logs; this directly samples from the distribution the model uses during actual user interactions. While the original submission did not include external validation experiments (such as human ratings of generated outputs or test-retest checks), we maintain that the method provides a more ecologically valid proxy for expressed behavior than forced Likert responses. In the revision we have added a dedicated paragraph in the Methods section justifying the approach, explicitly stating its assumptions, and acknowledging the absence of independent validation as a limitation that future work should address. revision: partial

  2. Referee: [Results] Results section (profile comparison): The abstract and main text state that the two profiles 'turn out to be substantially different,' yet no quantitative metrics (correlation, cosine similarity, or statistical tests with sample sizes and controls) or tables reporting these values are described. Without such evidence, the magnitude and reliability of the difference cannot be assessed and the claim that questionnaires mischaracterize LLM psychology remains under-supported.

    Authors: We agree that quantitative metrics are required to support the claim of substantial differences. The original manuscript presented the profile comparisons primarily through visualizations and qualitative description. In the revised Results section we now include a table reporting Pearson correlations, cosine similarities, and results of paired statistical tests (with sample sizes, degrees of freedom, and multiple-comparison corrections) between the questionnaire-derived and generation-based profiles for each of the eight models and four questionnaires. These metrics confirm low correlations and statistically significant differences, providing the requested quantitative grounding for the conclusion. revision: yes

Circularity Check

0 steps flagged

Empirical comparison of questionnaire and generation profiles shows no reduction to fitted inputs or self-referential definitions

full rationale

The paper performs a direct empirical comparison between Likert-scale responses from established human psychometric questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10) and generation probability scores derived from value- or personality-laden responses to real-world user queries across eight open-source LLMs. The central observation—that the resulting profiles differ substantially—is presented as an empirical finding rather than a mathematical derivation. No equations, fitted parameters, or predictions are involved that reduce outputs to inputs by construction. Citations to prior work on LLM psychological dispositions are used to contextualize the challenge but do not serve as load-bearing uniqueness theorems or self-citation chains that justify the core claim. The analysis remains self-contained through data collection and profile comparison without circular redefinition or smuggling of ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that generation probabilities on real queries constitute a valid ground-truth measure of LLM psychology against which questionnaire responses can be judged.

axioms (1)
  • domain assumption Generation probability scores of value- or personality-laden responses accurately reflect LLMs' psychological characteristics in real interactions
    This premise is required to interpret questionnaire responses as mischaracterizing rather than simply differing from generation behavior.

pith-pipeline@v0.9.0 · 5714 in / 1147 out tokens · 20300 ms · 2026-05-18T17:39:22.412131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

    cs.CL 2026-03 unverdicted novelty 6.0

    DOVE constructs a value codebook via rate-distortion variational optimization from 10K documents and measures LLM-human cultural alignment through unbalanced optimal transport, showing 31.56% correlation with downstre...