Large-scale analysis of 59,808 annotations shows persona prompting produces convergent captions but systematically varying justifications tied to socioeconomic and political attributes in multimodal LLM urban perception outputs.
Multimodal llms see sentiment
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
Understanding how visual content conveys sentiment is increasingly important in a digital landscape dominated by imagery. However, sentiment perception depends on complex scene-level semantics, making this a challenging task for computational models. This paper examines how Multimodal Large Language Models (MLLMs) perform sentiment analysis in images through a systematic, evaluation-driven study encompassing three perspectives: (i) direct sentiment classification from images using MLLMs; (ii) sentiment analysis on MLLM-generated descriptions using pre-trained LLMs; and (iii) fine-tuning these LLMs on sentiment-labeled descriptions to assess performance and generalization. Experiments on a recent benchmark show that a two-stage MLLM description-mediated pipeline can substantially improve prediction accuracy under several evaluation settings, particularly when the LLM component is fine-tuned. Across different agreement thresholds and sentiment granularities, the strongest configurations of this pipeline outperform lexicon-, CNN-, and Transformer-based baselines in our benchmark by up to 30.9%, 64.8%, and 42.4%, respectively. In cross-dataset evaluation, the proposed pipeline - without training or fine-tuning on the target dataset - still surpasses the best in-domain baseline by over 8%. Overall, the study provides a comprehensive assessment of MLLM description-mediated sentiment analysis, clarifying the conditions under which it is effective, the scenarios in which it fails, and its comparison with traditional vision-based approaches, while also providing a reproducible benchmark resource for future research.
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Persona prompting in multimodal LLMs for urban sentiment yields high within-persona stability but limited cross-persona variation, with no-persona models often matching or exceeding persona-conditioned agreement to human labels.
citing papers explorer
-
Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception
Large-scale analysis of 59,808 annotations shows persona prompting produces convergent captions but systematically varying justifications tied to socioeconomic and political attributes in multimodal LLM urban perception outputs.
-
Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception
Persona prompting in multimodal LLMs for urban sentiment yields high within-persona stability but limited cross-persona variation, with no-persona models often matching or exceeding persona-conditioned agreement to human labels.