Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception
Pith reviewed 2026-05-25 06:08 UTC · model grok-4.3
The pith
Persona prompting produces stable but minimally differentiated urban sentiment judgments in LLMs, with no-persona versions often matching human agreement as well.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Distinct personas induce stable and reproducible behavior within groups of multimodal LLM agents judging urban scenes from the PerceptSent dataset, yet cross-persona differentiation remains limited, with economic status and personality yielding statistically detectable but practically modest variation while gender shows no measurable effect and political orientation only negligible impact; agents further display an extremity bias that reduces performance on finer sentiment resolutions, and a no-persona baseline sometimes matches or exceeds persona-conditioned agreement with human labels.
What carries the argument
Factorial persona set (gender, economic status, political orientation, personality) applied to PerceptSent urban scene images, with metrics for within-persona consistency and cross-persona variation plus comparison to a no-persona baseline.
If this is right
- Agents sharing the same persona exhibit strong convergence and reproducibility in their sentiment outputs.
- Economic status and personality factors induce statistically detectable but modest cross-persona differences.
- Gender produces no measurable differentiation in the generated judgments.
- Political orientation produces only negligible cross-persona impact.
- Models perform well on coarse polarity tasks but degrade on higher-resolution sentiment categories due to extremity bias.
Where Pith is reading between the lines
- More elaborate or narrative persona descriptions beyond simple demographic labels may be required to elicit greater behavioral diversity.
- The limited value of this prompting style could extend to other LLM proxy tasks involving subjective human judgments such as policy preferences or product reviews.
- Direct comparisons between these LLM outputs and judgments from human raters who match the same demographic labels would clarify whether the observed stability is model-specific.
- Alternative approaches such as few-shot examples drawn from diverse human annotators could be tested to increase variation without relying on label-based personas.
Load-bearing premise
The selected label-based personas and PerceptSent human annotations are sufficient to detect and represent meaningful differences in human perceptual diversity for urban sentiment.
What would settle it
A replication using the same model and images but with substantially more detailed or open-ended persona descriptions that produces large, human-matching cross-persona differences while the no-persona version underperforms on agreement.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within-persona consistency and cross-persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross-persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse-grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label-based persona prompting does not capture fine-grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no-persona model sometimes matches or exceeds persona-conditioned agreement with human labels across all task variants, suggesting that simple label-based persona prompting may add limited annotation value in this setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that simple label-based persona prompting in multimodal LLMs for urban sentiment perception on PerceptSent images produces stable within-persona behavior but only limited cross-persona variation (modest effects from economic status and personality; none from gender; negligible from political orientation). It further reports an extremity bias that reduces performance on fine-grained sentiment tasks and finds that a no-persona baseline often matches or exceeds persona-conditioned agreement with human labels, concluding that such prompting adds limited annotation value.
Significance. If the central empirical comparison holds after addressing persona validation, the work would usefully caution the growing use of LLMs as perceptual proxies in urban analytics and annotation pipelines. The inclusion of a no-persona baseline is a strength that could become a standard control in future persona studies.
major comments (2)
- The central claim—that limited cross-persona differentiation demonstrates restricted value of persona prompting—depends on the chosen factorial dimensions (gender, economic status, political orientation, personality) being relevant to urban sentiment judgments. The manuscript provides no validation, prior literature, or analysis showing that these axes produce measurable differences in human PerceptSent annotations or that the label prompts activate distinct perceptual priors; without this, the stability and no-persona parity results cannot be attributed to the method rather than to the specific persona design.
- [Abstract] Abstract: the statements that economic status and personality 'induce statistically detectable but practically modest variation' and that the no-persona model 'sometimes matches or exceeds' agreement lack accompanying effect sizes, exact agreement metrics (e.g., Cohen's kappa or accuracy per task variant), sample sizes, or statistical thresholds. These details are load-bearing for assessing whether the observed differences support the 'limited value' conclusion.
minor comments (1)
- [Abstract] The abstract mentions 'multimodal LLMs' and 'multiple agents per persona' but does not name the specific model(s), image encoder, or number of replicates; these should be stated explicitly even in the abstract for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of our experimental design and presentation. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [—] The central claim—that limited cross-persona differentiation demonstrates restricted value of persona prompting—depends on the chosen factorial dimensions (gender, economic status, political orientation, personality) being relevant to urban sentiment judgments. The manuscript provides no validation, prior literature, or analysis showing that these axes produce measurable differences in human PerceptSent annotations or that the label prompts activate distinct perceptual priors; without this, the stability and no-persona parity results cannot be attributed to the method rather than to the specific persona design.
Authors: These dimensions were chosen because they are among the most frequently instantiated persona attributes in LLM agent literature and align with established factors in urban sociology and environmental psychology that shape place-based judgments. We acknowledge that the manuscript does not include a direct empirical validation (e.g., subgroup analysis of human annotators) or explicit tests confirming that the label prompts activate distinct priors. The PerceptSent dataset does not provide annotator demographics, precluding such an analysis. In revision we will (1) cite prior work linking socioeconomic status, personality, and political orientation to environmental perception, (2) clarify that the study evaluates commonly used persona types rather than claiming these axes exhaust all relevant variation, and (3) add an explicit limitations paragraph noting that stronger differentiation might appear with other persona framings or datasets containing demographic metadata. This addresses the attribution concern without overstating the current evidence. revision: partial
-
Referee: [Abstract] Abstract: the statements that economic status and personality 'induce statistically detectable but practically modest variation' and that the no-persona model 'sometimes matches or exceeds' agreement lack accompanying effect sizes, exact agreement metrics (e.g., Cohen's kappa or accuracy per task variant), sample sizes, or statistical thresholds. These details are load-bearing for assessing whether the observed differences support the 'limited value' conclusion.
Authors: We agree that the abstract would benefit from greater quantitative precision. In the revised version we will insert concise effect-size information (e.g., standardized mean differences or partial eta-squared for the detectable factors), note the primary agreement metric (Cohen’s kappa) and task variants, reference the number of images and agents evaluated, and indicate the significance threshold applied. These additions will be kept brief to respect abstract length limits while making the supporting evidence explicit. revision: yes
Circularity Check
No circularity; purely empirical evaluation against external human labels
full rationale
The paper conducts direct experimental comparisons of LLM outputs under factorial persona prompts versus human PerceptSent annotations and a no-persona baseline. No equations, derivations, fitted parameters, or self-citations are used to generate results; all reported metrics (within-persona consistency, cross-persona variation, agreement scores) are computed from fresh model inferences. The central claim follows immediately from these measurements without reduction to inputs by construction, satisfying the self-contained empirical criterion.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The PerceptSent dataset supplies reliable human annotations that serve as ground truth for sentiment judgments.
- domain assumption Label-based persona descriptions can be instantiated in prompts to produce behavior that meaningfully reflects human demographic and personality differences.
Reference graph
Works this paper leans on
-
[1]
Out of one, many: Using language models to simulate human samples,
L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, and D. Wingate, “Out of one, many: Using language models to simulate human samples,”Political Analysis, vol. 31, no. 3, p. 337–351, 2023
work page 2023
-
[2]
Using large language models to simulate multiple humans and replicate human subject studies,
G. Aher, R. I. Arriaga, and A. T. Kalai, “Using large language models to simulate multiple humans and replicate human subject studies,” in Proc. of ICML, (Honolulu, Hawaii, USA), JMLR.org, 2023
work page 2023
-
[3]
LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals
J. S. Park, C. Q. Zou, A. Shaw, B. M. Hill, C. Cai, M. R. Morris, R. Willer, P. Liang, and M. S. Bernstein, “Generative agent simulations of 1,000 people,”arXiv preprint arXiv:2411.10109, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Sensitivity, per- formance, robustness: Deconstructing the effect of sociodemographic prompting,
T. Beck, H. Schuff, A. Lauscher, and I. Gurevych, “Sensitivity, per- formance, robustness: Deconstructing the effect of sociodemographic prompting,” inProc of EACL, (St. Julian’s, Malta), pp. 2589–2615, 2024
work page 2024
-
[5]
Quantifying the persona effect in LLM simula- tions,
T. Hu and N. Collier, “Quantifying the persona effect in LLM simula- tions,” inProc of ACL(L.-W. Ku, A. Martins, and V . Srikumar, eds.), (Bangkok, Thailand), pp. 10289–10307, ACL, Aug. 2024
work page 2024
-
[6]
Simulating society requires simulating thought,
C. J. Li, J. Wu, Z. Mo, A. Qu, Y . Tang, K. I. Zhao, Y . Gan, J. Fan, J. Yu, J. Zhao,et al., “Simulating society requires simulating thought,” ArXiv arXiv:2506.06958, 2025
-
[7]
Can generative ai improve social science?,
C. A. Bail, “Can generative ai improve social science?,”PNAS, vol. 121, no. 21, p. e2314021121, 2024
work page 2024
-
[8]
A. Wang, J. Morgenstern, and J. P. Dickerson, “Large language models that replace human participants can harmfully misportray and flatten identity groups,”Nat Mach Intell, vol. 7, no. 3, pp. 400–411, 2025
work page 2025
-
[9]
Not yet: Large language models cannot replace human respondents for psychometric research,
P. Wang, H. Zou, Z. Yan, F. Guo, T. Sun, Z. Xiao, and B. Zhang, “Not yet: Large language models cannot replace human respondents for psychometric research,”OSF: osf.io/preprints/osf/rwy9b v1, 2024
work page 2024
-
[10]
M. Lutz, I. Sen, G. Ahnert, E. Rogers, and M. Strohmaier, “The prompt makes the person(a): A systematic evaluation of sociodemographic persona prompting for large language models,” inProc. of EMNLP, (Suzhou, China), pp. 23212–23237, ACL, Nov. 2025
work page 2025
-
[11]
PerceptSent - exploring subjectivity in a novel dataset for visual sentiment analysis,
C. R. Lopes, R. Minetto, M. R. Delgado, and T. H. Silva, “PerceptSent - exploring subjectivity in a novel dataset for visual sentiment analysis,” IEEE Transactions on Affective Computing, vol. 14, no. 3, 2023
work page 2023
-
[12]
Outdoors- ent: Sentiment analysis of urban outdoor images by using semantic and deep features,
W. B. d. Oliveira, L. B. Dorini, R. Minetto, and T. H. Silva, “Outdoors- ent: Sentiment analysis of urban outdoor images by using semantic and deep features,”ACM Trans. Inf. Syst., vol. 38, Apr. 2020
work page 2020
-
[13]
Large language models as simulated economic agents: What can we learn from homo silicus?,
A. Filippas, J. J. Horton, and B. S. Manning, “Large language models as simulated economic agents: What can we learn from homo silicus?,” inProc. of EC, (New Haven, CT, USA), p. 614–615, ACM, 2024
work page 2024
-
[14]
Humans and llms rate deliberation as superior to intuition on complex reasoning tasks,
W. De Neys and M. Raoelison, “Humans and llms rate deliberation as superior to intuition on complex reasoning tasks,”Communications Psychology, vol. 3, no. 1, p. 141, 2025
work page 2025
-
[15]
A. Malik, N. Sabri, M. M. Karnaze, and M. ElSherief, “Are LLMs empathetic to all? investigating the influence of multi-demographic personas on a model’s empathy,” inProc. of EMNLP, (Suzhou, China), pp. 24938–24959, ACL, Nov. 2025
work page 2025
-
[16]
Synthetic personas distort the structure of human belief systems,
C. Barrie and R. Cerina, “Synthetic personas distort the structure of human belief systems,”OSF: osf.io/preprints/socarxiv/n7fq8 v1, 2026
work page 2026
-
[17]
Multimodal llms see sentiment,
N. B. da Silva, J. Harrison, R. Minetto, M. R. Delgado, B. T. Nassu, and T. H. Silva, “Multimodal llms see sentiment,”ArXiv: https://arxiv.org/abs/2508.16873, 2025
-
[18]
R. Tourangeau and T. W. Smith, “Asking sensitive questions: The impact of data collection mode, question format, and question context,”Public opinion quarterly, vol. 60, no. 2, pp. 275–304, 1996
work page 1996
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.