pith. machine review for the scientific record. sign in

arxiv: 2604.28048 · v1 · submitted 2026-04-30 · 💻 cs.CL · cs.SI

Recognition: unknown

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:47 UTC · model grok-4.3

classification 💻 cs.CL cs.SI
keywords LLM agentspersona promptingurban sentiment analysisperceptual variationmultimodal modelsannotation proxiessentiment resolutionPerceptSent dataset
0
0 comments X

The pith

Simple persona prompting in multimodal LLMs produces stable urban sentiment judgments but limited variation across personas, often matched by unprompted models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether distinct personas assigned to LLM agents generate meaningful diversity in judgments of urban scene sentiment. Multiple agents per persona, spanning gender, economic status, politics, and personality, rate images from the PerceptSent dataset. Agents sharing a persona show high agreement, indicating stable and reproducible outputs. Cross-persona differences are small overall, with economic status and personality causing modest shifts while gender produces none and politics negligible impact. Models run without any persona conditioning match or exceed persona-conditioned alignment with human labels on most tasks, implying limited added value from basic label-based personas.

Core claim

Multimodal LLMs exhibit strong within-persona consistency in urban sentiment judgments under demographic and personality conditioning, yet cross-persona differentiation stays modest, with only economic status and personality yielding statistically detectable but practically small effects while gender shows no measurable impact and political orientation negligible impact. Agents display an extremity bias, favoring extreme categories and underusing intermediate ones common in human annotations. This leads to strong performance on coarse polarity tasks but degraded results as sentiment resolution increases. Removing persona conditioning does not reduce and sometimes improves agreement with held

What carries the argument

Factorial persona set (gender, economic status, political orientation, personality) instantiated in multiple multimodal LLM agents per persona, measuring within-persona stability and cross-persona variation on PerceptSent urban images against human labels and a no-persona baseline.

If this is right

  • Coarse-grained polarity detection remains reliable even without persona conditioning.
  • Fine-grained sentiment scales suffer from LLM extremity bias regardless of persona use.
  • Persona labels may be unnecessary for achieving human-level agreement on urban perception tasks.
  • Multiple agents per fixed persona can still provide reproducible proxy measurements.
  • Simple label-based conditioning does not reliably simulate demographic perceptual diversity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Validation of LLM proxies should include direct comparisons to matched human demographic samples rather than relying on label-induced variation alone.
  • The limited cross-persona effect may extend to other subjective judgment domains where LLMs are tasked with modeling population differences.
  • Alternative conditioning methods such as real demographic examples or targeted fine-tuning could be tested to increase perceptual variation.
  • The observed extremity bias suggests LLMs may systematically under-represent moderate human views in urban analysis applications.

Load-bearing premise

The chosen persona descriptions and the PerceptSent image set together produce a representative sample of the perceptual variation that actually exists across real human demographic and personality groups.

What would settle it

A direct head-to-head comparison where human participants from the exact demographic and personality groups described in the personas rate the same PerceptSent images, testing whether measured LLM cross-persona differences align with actual human group differences.

Figures

Figures reproduced from arXiv: 2604.28048 by Daniel Silver, Neemias B da Silva, Rodrigo Minetto, Thiago H Silva.

Figure 1
Figure 1. Figure 1: High-level methodology overview. A. Persona Design We construct a balanced full factorial design across four persona dimensions (Table I). We selected three commonly studied sociodemographic dimensions (gender, economic sta￾tus, and political orientation) and included personality as an additional dimension, following the suggestion that persona effects may depend not only on demographic cues but also on ps… view at source ↗
Figure 2
Figure 2. Figure 2: Characteristics of the studied dataset. C. Annotation Pipeline The annotation pipeline is a three-node LangGraph1 with conditional retry logic ( view at source ↗
Figure 3
Figure 3. Figure 3: Three-node annotation pipeline. The Eval & Retry cluster handles three failure modes (timeout, JSON parse error, bad view at source ↗
Figure 4
Figure 4. Figure 4: PerceptSent dataset example. Since MLLM predictions are fixed (no model training), all images that satisfy the σ criterion are included in every fold. To estimate variability, we resample 60% of the annotation pool per image in each fold (716 annotations per image, from ≈1,194 total), recompute the modal sentiment, and evaluate performance. This yields between 2,868 and 35,850 annotations per fold, dependi… view at source ↗
Figure 5
Figure 5. Figure 5: Predicted sentiment distribution across all three condi view at source ↗
Figure 6
Figure 6. Figure 6: Row-normalized sentiment proportion heatmap across view at source ↗
Figure 7
Figure 7. Figure 7: Sentiment distribution by persona dimension ( view at source ↗
Figure 8
Figure 8. Figure 8: Pooled out-of-fold confusion matrices across agreement view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within-persona consistency and cross-persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross-persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse-grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label-based persona prompting does not capture fine-grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no-persona model sometimes matches or exceeds persona-conditioned agreement with human labels across all task variants, suggesting that simple label-based persona prompting may add limited annotation value in this setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper investigates whether simple label-based persona prompting induces meaningful and reproducible behavioral diversity in multimodal LLMs when used as proxies for human urban sentiment perception. Using a factorial design of personas (gender, economic status, political orientation, personality) instantiated with multiple agents each, the authors evaluate judgments on images from the PerceptSent dataset. They report strong within-persona consistency, limited cross-persona variation (statistically detectable but practically modest for economic status and personality; none for gender; negligible for political orientation), an extremity bias relative to human annotations, and competitive or superior performance of an unprompted no-persona baseline against human labels, especially on coarse polarity tasks. The conclusion is that such prompting adds limited annotation value in this setting.

Significance. If the empirical patterns hold, the work is significant for NLP and urban analytics applications that rely on LLMs to simulate diverse human judgments. It supplies a direct, controlled comparison against both human labels and a no-persona baseline, distinguishes statistical detectability from practical effect size, and documents an extremity bias that degrades performance at finer sentiment resolutions. These elements provide concrete evidence that basic persona conditioning may not reliably expand the range of outputs beyond default model behavior, encouraging more rigorous validation of prompting techniques before deployment as human proxies.

major comments (2)
  1. [Methods / Results] Methods and Results sections: The central interpretation—that simple persona prompting adds limited value—rests on the assumption that the chosen persona labels and PerceptSent images are capable of surfacing genuine cross-group perceptual differences if the prompting technique were effective. The manuscript does not report whether the human annotators underlying the PerceptSent labels exhibit measurable variation along the tested axes (e.g., differences by economic status or personality on the same scenes). Without such validation or an explicit discussion of this boundary condition, the observed stability and limited differentiation could reflect an insensitive test rather than a general limitation of label-based persona prompting.
  2. [Results] Results section (no-persona comparison): The claim that the no-persona model 'sometimes matches or exceeds' persona-conditioned agreement requires clearer quantification. Report the precise conditions, sample sizes, and any equivalence or superiority tests used; also provide effect sizes (e.g., Cohen’s d or raw agreement deltas) for the 'modest' cross-persona effects to allow readers to judge practical significance independently of p-values.
minor comments (3)
  1. [Methods] Methods: Explicitly state the exact number of agents per persona, total images evaluated, image-selection criteria, and any multiple-testing corrections applied to the statistical tests for cross-persona differences.
  2. [Results / Figures] Figures and text: Clarify how the extremity bias is operationalized (e.g., distribution shift metrics) and ensure all sentiment scales and agreement metrics are defined consistently between LLM outputs and human labels.
  3. [Discussion] Discussion: Add a short paragraph on potential overlaps or ambiguities in the persona prompt wording that might reduce differentiation (e.g., between economic status and personality descriptors).

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments, which highlight opportunities to clarify the scope of our findings and improve quantitative reporting. We address each major comment point by point below, indicating revisions where appropriate.

read point-by-point responses
  1. Referee: [Methods / Results] Methods and Results sections: The central interpretation—that simple persona prompting adds limited value—rests on the assumption that the chosen persona labels and PerceptSent images are capable of surfacing genuine cross-group perceptual differences if the prompting technique were effective. The manuscript does not report whether the human annotators underlying the PerceptSent labels exhibit measurable variation along the tested axes (e.g., differences by economic status or personality on the same scenes). Without such validation or an explicit discussion of this boundary condition, the observed stability and limited differentiation could reflect an insensitive test rather than a general limitation of label-based persona prompting.

    Authors: We agree that this boundary condition merits explicit discussion. The PerceptSent dataset supplies aggregated human labels without per-annotator metadata on economic status, personality, or related attributes, so we cannot empirically demonstrate cross-group variation in the human annotations along the tested axes. We will revise the manuscript to add a dedicated paragraph in the Methods and Discussion sections acknowledging this limitation and clarifying that our conclusions concern the ability of label-based persona prompting to improve alignment with the available human label distributions (rather than to reproduce all possible human perceptual differences). Our core empirical result—that persona conditioning did not improve agreement over the no-persona baseline—remains valid under this constraint, as it is evaluated directly against the provided human labels. revision: partial

  2. Referee: [Results] Results section (no-persona comparison): The claim that the no-persona model 'sometimes matches or exceeds' persona-conditioned agreement requires clearer quantification. Report the precise conditions, sample sizes, and any equivalence or superiority tests used; also provide effect sizes (e.g., Cohen’s d or raw agreement deltas) for the 'modest' cross-persona effects to allow readers to judge practical significance independently of p-values.

    Authors: We thank the referee for this recommendation to strengthen reporting. In the revised manuscript we will expand the Results section with precise agreement metrics (e.g., Cohen’s kappa or accuracy) for the no-persona condition versus each persona variant across the polarity, 3-class, and 5-class tasks. We will specify the evaluation conditions, the number of agents instantiated per persona, and the number of runs performed. Effect sizes, including Cohen’s d for cross-persona differences and raw agreement deltas, will be reported alongside p-values. Statistical tests used for comparisons (including any equivalence testing) will be detailed, with results presented in updated tables to permit independent assessment of practical significance. revision: yes

standing simulated objections not resolved
  • The PerceptSent dataset does not include per-annotator demographic or personality metadata, so we cannot report measurable variation in human labels along the tested axes.

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of prompted outputs to human labels and baseline

full rationale

The paper performs an empirical evaluation by instantiating LLM agents with label-based personas (gender, economic status, political orientation, personality), running them on PerceptSent urban images, measuring within-persona consistency and cross-persona variation, and comparing agreement with human annotations against an unprompted baseline. No quantities are defined in terms of fitted parameters that are then called predictions, no self-definitional loops exist in the metrics or claims, and no load-bearing self-citations or uniqueness theorems reduce the central result to prior author work. The observed stability, limited differentiation, and competitive no-persona performance are direct observational outcomes from the experimental runs, making the derivation self-contained against external human labels.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the chosen persona prompts and the PerceptSent images jointly sample the relevant space of human perceptual variation. No new physical entities or mathematical constants are introduced. The only free parameters are the discrete persona factor levels and the number of agents instantiated per cell; these are chosen by the experimenters rather than fitted to the target labels.

free parameters (2)
  • persona factor levels
    Discrete choices of gender, economic status, political orientation, and personality traits used to construct the prompt templates.
  • agents per persona
    Number of independent LLM instantiations run for each persona combination; affects the within-persona consistency estimate.
axioms (2)
  • domain assumption LLM outputs under fixed prompts are sufficiently stable to treat multiple runs as independent samples of the same persona distribution
    Invoked when the authors interpret high within-persona agreement as evidence of stable behavior.
  • domain assumption The PerceptSent human annotations constitute a valid ground-truth distribution of urban sentiment perception across demographic groups
    Required for the claim that persona-conditioned outputs add limited value relative to human labels.

pith-pipeline@v0.9.0 · 5531 in / 1608 out tokens · 78136 ms · 2026-05-07T06:47:59.068578+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Out of one, many: Using language models to simulate human samples,

    L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, and D. Wingate, “Out of one, many: Using language models to simulate human samples,”Political Analysis, vol. 31, no. 3, p. 337–351, 2023

  2. [2]

    Using large language models to simulate multiple humans and replicate human subject studies,

    G. Aher, R. I. Arriaga, and A. T. Kalai, “Using large language models to simulate multiple humans and replicate human subject studies,” in Proc. of ICML, (Honolulu, Hawaii, USA), JMLR.org, 2023

  3. [3]

    LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

    J. S. Park, C. Q. Zou, A. Shaw, B. M. Hill, C. Cai, M. R. Morris, R. Willer, P. Liang, and M. S. Bernstein, “Generative agent simulations of 1,000 people,”arXiv preprint arXiv:2411.10109, 2024

  4. [4]

    Sensitivity, per- formance, robustness: Deconstructing the effect of sociodemographic prompting,

    T. Beck, H. Schuff, A. Lauscher, and I. Gurevych, “Sensitivity, per- formance, robustness: Deconstructing the effect of sociodemographic prompting,” inProc of EACL, (St. Julian’s, Malta), pp. 2589–2615, 2024

  5. [5]

    Quantifying the persona effect in LLM simula- tions,

    T. Hu and N. Collier, “Quantifying the persona effect in LLM simula- tions,” inProc of ACL(L.-W. Ku, A. Martins, and V . Srikumar, eds.), (Bangkok, Thailand), pp. 10289–10307, ACL, Aug. 2024

  6. [6]

    Simulating society requires simulating thought,

    C. J. Li, J. Wu, Z. Mo, A. Qu, Y . Tang, K. I. Zhao, Y . Gan, J. Fan, J. Yu, J. Zhao,et al., “Simulating society requires simulating thought,” ArXiv arXiv:2506.06958, 2025

  7. [7]

    Can generative ai improve social science?,

    C. A. Bail, “Can generative ai improve social science?,”PNAS, vol. 121, no. 21, p. e2314021121, 2024

  8. [8]

    Large language models that replace human participants can harmfully misportray and flatten identity groups,

    A. Wang, J. Morgenstern, and J. P. Dickerson, “Large language models that replace human participants can harmfully misportray and flatten identity groups,”Nat Mach Intell, vol. 7, no. 3, pp. 400–411, 2025

  9. [9]

    Not yet: Large language models cannot replace human respondents for psychometric research,

    P. Wang, H. Zou, Z. Yan, F. Guo, T. Sun, Z. Xiao, and B. Zhang, “Not yet: Large language models cannot replace human respondents for psychometric research,”OSF: osf.io/preprints/osf/rwy9b v1, 2024

  10. [10]

    The prompt makes the person(a): A systematic evaluation of sociodemographic persona prompting for large language models,

    M. Lutz, I. Sen, G. Ahnert, E. Rogers, and M. Strohmaier, “The prompt makes the person(a): A systematic evaluation of sociodemographic persona prompting for large language models,” inProc. of EMNLP, (Suzhou, China), pp. 23212–23237, ACL, Nov. 2025

  11. [11]

    PerceptSent - exploring subjectivity in a novel dataset for visual sentiment analysis,

    C. R. Lopes, R. Minetto, M. R. Delgado, and T. H. Silva, “PerceptSent - exploring subjectivity in a novel dataset for visual sentiment analysis,” IEEE Transactions on Affective Computing, vol. 14, no. 3, 2023

  12. [12]

    Outdoors- ent: Sentiment analysis of urban outdoor images by using semantic and deep features,

    W. B. d. Oliveira, L. B. Dorini, R. Minetto, and T. H. Silva, “Outdoors- ent: Sentiment analysis of urban outdoor images by using semantic and deep features,”ACM Trans. Inf. Syst., vol. 38, Apr. 2020

  13. [13]

    Large language models as simulated economic agents: What can we learn from homo silicus?,

    A. Filippas, J. J. Horton, and B. S. Manning, “Large language models as simulated economic agents: What can we learn from homo silicus?,” inProc. of EC, (New Haven, CT, USA), p. 614–615, ACM, 2024

  14. [14]

    Humans and llms rate deliberation as superior to intuition on complex reasoning tasks,

    W. De Neys and M. Raoelison, “Humans and llms rate deliberation as superior to intuition on complex reasoning tasks,”Communications Psychology, vol. 3, no. 1, p. 141, 2025

  15. [15]

    Are LLMs empathetic to all? investigating the influence of multi-demographic personas on a model’s empathy,

    A. Malik, N. Sabri, M. M. Karnaze, and M. ElSherief, “Are LLMs empathetic to all? investigating the influence of multi-demographic personas on a model’s empathy,” inProc. of EMNLP, (Suzhou, China), pp. 24938–24959, ACL, Nov. 2025

  16. [16]

    Synthetic personas distort the structure of human belief systems,

    C. Barrie and R. Cerina, “Synthetic personas distort the structure of human belief systems,”OSF: osf.io/preprints/socarxiv/n7fq8 v1, 2026

  17. [17]

    Multimodal llms see sentiment,

    N. B. da Silva, J. Harrison, R. Minetto, M. R. Delgado, B. T. Nassu, and T. H. Silva, “Multimodal llms see sentiment,”ArXiv: https://arxiv.org/abs/2508.16873, 2025

  18. [18]

    Asking sensitive questions: The impact of data collection mode, question format, and question context,

    R. Tourangeau and T. W. Smith, “Asking sensitive questions: The impact of data collection mode, question format, and question context,”Public opinion quarterly, vol. 60, no. 2, pp. 275–304, 1996