pith. sign in

arxiv: 2604.28048 · v2 · pith:6RTLUQDUnew · submitted 2026-04-30 · 💻 cs.CL · cs.SI

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Pith reviewed 2026-05-25 06:08 UTC · model grok-4.3

classification 💻 cs.CL cs.SI
keywords persona promptingLLM agentsurban sentiment perceptionPerceptSent datasetannotation agreementmultimodal LLMsbehavioral consistencyextremity bias
0
0 comments X

The pith

Persona prompting produces stable but minimally differentiated urban sentiment judgments in LLMs, with no-persona versions often matching human agreement as well.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether assigning distinct personas to multimodal LLM agents generates varied and human-like judgments of urban scene sentiments. It applies a factorial combination of gender, economic status, political orientation, and personality traits to images from the PerceptSent dataset and measures both consistency within each persona and differences across them. Agents sharing a persona converge strongly on their outputs, but differences between personas prove small, with gender showing no effect and political orientation only negligible impact. Economic status and personality produce modest detectable shifts, yet models exhibit an extremity bias that collapses intermediate sentiment categories. A no-persona baseline frequently equals or surpasses the persona-conditioned versions in agreement with human labels across task granularities.

Core claim

Distinct personas induce stable and reproducible behavior within groups of multimodal LLM agents judging urban scenes from the PerceptSent dataset, yet cross-persona differentiation remains limited, with economic status and personality yielding statistically detectable but practically modest variation while gender shows no measurable effect and political orientation only negligible impact; agents further display an extremity bias that reduces performance on finer sentiment resolutions, and a no-persona baseline sometimes matches or exceeds persona-conditioned agreement with human labels.

What carries the argument

Factorial persona set (gender, economic status, political orientation, personality) applied to PerceptSent urban scene images, with metrics for within-persona consistency and cross-persona variation plus comparison to a no-persona baseline.

If this is right

  • Agents sharing the same persona exhibit strong convergence and reproducibility in their sentiment outputs.
  • Economic status and personality factors induce statistically detectable but modest cross-persona differences.
  • Gender produces no measurable differentiation in the generated judgments.
  • Political orientation produces only negligible cross-persona impact.
  • Models perform well on coarse polarity tasks but degrade on higher-resolution sentiment categories due to extremity bias.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • More elaborate or narrative persona descriptions beyond simple demographic labels may be required to elicit greater behavioral diversity.
  • The limited value of this prompting style could extend to other LLM proxy tasks involving subjective human judgments such as policy preferences or product reviews.
  • Direct comparisons between these LLM outputs and judgments from human raters who match the same demographic labels would clarify whether the observed stability is model-specific.
  • Alternative approaches such as few-shot examples drawn from diverse human annotators could be tested to increase variation without relying on label-based personas.

Load-bearing premise

The selected label-based personas and PerceptSent human annotations are sufficient to detect and represent meaningful differences in human perceptual diversity for urban sentiment.

What would settle it

A replication using the same model and images but with substantially more detailed or open-ended persona descriptions that produces large, human-matching cross-persona differences while the no-persona version underperforms on agreement.

Figures

Figures reproduced from arXiv: 2604.28048 by Daniel Silver, Neemias B da Silva, Rodrigo Minetto, Thiago H Silva.

Figure 1
Figure 1. Figure 1: High-level methodology overview. A. Persona Design We construct a balanced full factorial design across four persona dimensions (Table I). We selected three commonly studied sociodemographic dimensions (gender, economic sta￾tus, and political orientation) and included personality as an additional dimension, following the suggestion that persona effects may depend not only on demographic cues but also on ps… view at source ↗
Figure 2
Figure 2. Figure 2: Characteristics of the studied dataset. C. Annotation Pipeline The annotation pipeline is a three-node LangGraph1 with conditional retry logic ( view at source ↗
Figure 3
Figure 3. Figure 3: Three-node annotation pipeline. The Eval & Retry cluster handles three failure modes (timeout, JSON parse error, bad view at source ↗
Figure 4
Figure 4. Figure 4: PerceptSent dataset example. Since MLLM predictions are fixed (no model training), all images that satisfy the σ criterion are included in every fold. To estimate variability, we resample 60% of the annotation pool per image in each fold (716 annotations per image, from ≈1,194 total), recompute the modal sentiment, and evaluate performance. This yields between 2,868 and 35,850 annotations per fold, dependi… view at source ↗
Figure 5
Figure 5. Figure 5: Predicted sentiment distribution across all three condi view at source ↗
Figure 6
Figure 6. Figure 6: Row-normalized sentiment proportion heatmap across view at source ↗
Figure 7
Figure 7. Figure 7: Sentiment distribution by persona dimension ( view at source ↗
Figure 8
Figure 8. Figure 8: Pooled out-of-fold confusion matrices across agreement view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within-persona consistency and cross-persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross-persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse-grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label-based persona prompting does not capture fine-grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no-persona model sometimes matches or exceeds persona-conditioned agreement with human labels across all task variants, suggesting that simple label-based persona prompting may add limited annotation value in this setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that simple label-based persona prompting in multimodal LLMs for urban sentiment perception on PerceptSent images produces stable within-persona behavior but only limited cross-persona variation (modest effects from economic status and personality; none from gender; negligible from political orientation). It further reports an extremity bias that reduces performance on fine-grained sentiment tasks and finds that a no-persona baseline often matches or exceeds persona-conditioned agreement with human labels, concluding that such prompting adds limited annotation value.

Significance. If the central empirical comparison holds after addressing persona validation, the work would usefully caution the growing use of LLMs as perceptual proxies in urban analytics and annotation pipelines. The inclusion of a no-persona baseline is a strength that could become a standard control in future persona studies.

major comments (2)
  1. The central claim—that limited cross-persona differentiation demonstrates restricted value of persona prompting—depends on the chosen factorial dimensions (gender, economic status, political orientation, personality) being relevant to urban sentiment judgments. The manuscript provides no validation, prior literature, or analysis showing that these axes produce measurable differences in human PerceptSent annotations or that the label prompts activate distinct perceptual priors; without this, the stability and no-persona parity results cannot be attributed to the method rather than to the specific persona design.
  2. [Abstract] Abstract: the statements that economic status and personality 'induce statistically detectable but practically modest variation' and that the no-persona model 'sometimes matches or exceeds' agreement lack accompanying effect sizes, exact agreement metrics (e.g., Cohen's kappa or accuracy per task variant), sample sizes, or statistical thresholds. These details are load-bearing for assessing whether the observed differences support the 'limited value' conclusion.
minor comments (1)
  1. [Abstract] The abstract mentions 'multimodal LLMs' and 'multiple agents per persona' but does not name the specific model(s), image encoder, or number of replicates; these should be stated explicitly even in the abstract for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of our experimental design and presentation. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [—] The central claim—that limited cross-persona differentiation demonstrates restricted value of persona prompting—depends on the chosen factorial dimensions (gender, economic status, political orientation, personality) being relevant to urban sentiment judgments. The manuscript provides no validation, prior literature, or analysis showing that these axes produce measurable differences in human PerceptSent annotations or that the label prompts activate distinct perceptual priors; without this, the stability and no-persona parity results cannot be attributed to the method rather than to the specific persona design.

    Authors: These dimensions were chosen because they are among the most frequently instantiated persona attributes in LLM agent literature and align with established factors in urban sociology and environmental psychology that shape place-based judgments. We acknowledge that the manuscript does not include a direct empirical validation (e.g., subgroup analysis of human annotators) or explicit tests confirming that the label prompts activate distinct priors. The PerceptSent dataset does not provide annotator demographics, precluding such an analysis. In revision we will (1) cite prior work linking socioeconomic status, personality, and political orientation to environmental perception, (2) clarify that the study evaluates commonly used persona types rather than claiming these axes exhaust all relevant variation, and (3) add an explicit limitations paragraph noting that stronger differentiation might appear with other persona framings or datasets containing demographic metadata. This addresses the attribution concern without overstating the current evidence. revision: partial

  2. Referee: [Abstract] Abstract: the statements that economic status and personality 'induce statistically detectable but practically modest variation' and that the no-persona model 'sometimes matches or exceeds' agreement lack accompanying effect sizes, exact agreement metrics (e.g., Cohen's kappa or accuracy per task variant), sample sizes, or statistical thresholds. These details are load-bearing for assessing whether the observed differences support the 'limited value' conclusion.

    Authors: We agree that the abstract would benefit from greater quantitative precision. In the revised version we will insert concise effect-size information (e.g., standardized mean differences or partial eta-squared for the detectable factors), note the primary agreement metric (Cohen’s kappa) and task variants, reference the number of images and agents evaluated, and indicate the significance threshold applied. These additions will be kept brief to respect abstract length limits while making the supporting evidence explicit. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation against external human labels

full rationale

The paper conducts direct experimental comparisons of LLM outputs under factorial persona prompts versus human PerceptSent annotations and a no-persona baseline. No equations, derivations, fitted parameters, or self-citations are used to generate results; all reported metrics (within-persona consistency, cross-persona variation, agreement scores) are computed from fresh model inferences. The central claim follows immediately from these measurements without reduction to inputs by construction, satisfying the self-contained empirical criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard domain assumptions about dataset validity and the meaningfulness of label-based personas rather than introducing fitted parameters or new entities.

axioms (2)
  • domain assumption The PerceptSent dataset supplies reliable human annotations that serve as ground truth for sentiment judgments.
    Agreement with these labels is used to evaluate both persona and no-persona models.
  • domain assumption Label-based persona descriptions can be instantiated in prompts to produce behavior that meaningfully reflects human demographic and personality differences.
    This premise underpins the entire factorial design and the interpretation of limited cross-persona variation.

pith-pipeline@v0.9.0 · 5762 in / 1321 out tokens · 42865 ms · 2026-05-25T06:08:23.836008+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

  1. [1]

    Out of one, many: Using language models to simulate human samples,

    L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, and D. Wingate, “Out of one, many: Using language models to simulate human samples,”Political Analysis, vol. 31, no. 3, p. 337–351, 2023

  2. [2]

    Using large language models to simulate multiple humans and replicate human subject studies,

    G. Aher, R. I. Arriaga, and A. T. Kalai, “Using large language models to simulate multiple humans and replicate human subject studies,” in Proc. of ICML, (Honolulu, Hawaii, USA), JMLR.org, 2023

  3. [3]

    LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

    J. S. Park, C. Q. Zou, A. Shaw, B. M. Hill, C. Cai, M. R. Morris, R. Willer, P. Liang, and M. S. Bernstein, “Generative agent simulations of 1,000 people,”arXiv preprint arXiv:2411.10109, 2024

  4. [4]

    Sensitivity, per- formance, robustness: Deconstructing the effect of sociodemographic prompting,

    T. Beck, H. Schuff, A. Lauscher, and I. Gurevych, “Sensitivity, per- formance, robustness: Deconstructing the effect of sociodemographic prompting,” inProc of EACL, (St. Julian’s, Malta), pp. 2589–2615, 2024

  5. [5]

    Quantifying the persona effect in LLM simula- tions,

    T. Hu and N. Collier, “Quantifying the persona effect in LLM simula- tions,” inProc of ACL(L.-W. Ku, A. Martins, and V . Srikumar, eds.), (Bangkok, Thailand), pp. 10289–10307, ACL, Aug. 2024

  6. [6]

    Simulating society requires simulating thought,

    C. J. Li, J. Wu, Z. Mo, A. Qu, Y . Tang, K. I. Zhao, Y . Gan, J. Fan, J. Yu, J. Zhao,et al., “Simulating society requires simulating thought,” ArXiv arXiv:2506.06958, 2025

  7. [7]

    Can generative ai improve social science?,

    C. A. Bail, “Can generative ai improve social science?,”PNAS, vol. 121, no. 21, p. e2314021121, 2024

  8. [8]

    Large language models that replace human participants can harmfully misportray and flatten identity groups,

    A. Wang, J. Morgenstern, and J. P. Dickerson, “Large language models that replace human participants can harmfully misportray and flatten identity groups,”Nat Mach Intell, vol. 7, no. 3, pp. 400–411, 2025

  9. [9]

    Not yet: Large language models cannot replace human respondents for psychometric research,

    P. Wang, H. Zou, Z. Yan, F. Guo, T. Sun, Z. Xiao, and B. Zhang, “Not yet: Large language models cannot replace human respondents for psychometric research,”OSF: osf.io/preprints/osf/rwy9b v1, 2024

  10. [10]

    The prompt makes the person(a): A systematic evaluation of sociodemographic persona prompting for large language models,

    M. Lutz, I. Sen, G. Ahnert, E. Rogers, and M. Strohmaier, “The prompt makes the person(a): A systematic evaluation of sociodemographic persona prompting for large language models,” inProc. of EMNLP, (Suzhou, China), pp. 23212–23237, ACL, Nov. 2025

  11. [11]

    PerceptSent - exploring subjectivity in a novel dataset for visual sentiment analysis,

    C. R. Lopes, R. Minetto, M. R. Delgado, and T. H. Silva, “PerceptSent - exploring subjectivity in a novel dataset for visual sentiment analysis,” IEEE Transactions on Affective Computing, vol. 14, no. 3, 2023

  12. [12]

    Outdoors- ent: Sentiment analysis of urban outdoor images by using semantic and deep features,

    W. B. d. Oliveira, L. B. Dorini, R. Minetto, and T. H. Silva, “Outdoors- ent: Sentiment analysis of urban outdoor images by using semantic and deep features,”ACM Trans. Inf. Syst., vol. 38, Apr. 2020

  13. [13]

    Large language models as simulated economic agents: What can we learn from homo silicus?,

    A. Filippas, J. J. Horton, and B. S. Manning, “Large language models as simulated economic agents: What can we learn from homo silicus?,” inProc. of EC, (New Haven, CT, USA), p. 614–615, ACM, 2024

  14. [14]

    Humans and llms rate deliberation as superior to intuition on complex reasoning tasks,

    W. De Neys and M. Raoelison, “Humans and llms rate deliberation as superior to intuition on complex reasoning tasks,”Communications Psychology, vol. 3, no. 1, p. 141, 2025

  15. [15]

    Are LLMs empathetic to all? investigating the influence of multi-demographic personas on a model’s empathy,

    A. Malik, N. Sabri, M. M. Karnaze, and M. ElSherief, “Are LLMs empathetic to all? investigating the influence of multi-demographic personas on a model’s empathy,” inProc. of EMNLP, (Suzhou, China), pp. 24938–24959, ACL, Nov. 2025

  16. [16]

    Synthetic personas distort the structure of human belief systems,

    C. Barrie and R. Cerina, “Synthetic personas distort the structure of human belief systems,”OSF: osf.io/preprints/socarxiv/n7fq8 v1, 2026

  17. [17]

    Multimodal llms see sentiment,

    N. B. da Silva, J. Harrison, R. Minetto, M. R. Delgado, B. T. Nassu, and T. H. Silva, “Multimodal llms see sentiment,”ArXiv: https://arxiv.org/abs/2508.16873, 2025

  18. [18]

    Asking sensitive questions: The impact of data collection mode, question format, and question context,

    R. Tourangeau and T. W. Smith, “Asking sensitive questions: The impact of data collection mode, question format, and question context,”Public opinion quarterly, vol. 60, no. 2, pp. 275–304, 1996