pith. machine review for the scientific record. sign in

arxiv: 2605.14635 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multi-label visual emotionemotion distribution predictionMLLM evaluationannotation aggregationbenchmark datasetvisual emotion analysismultimodal modelsLLM-as-a-judge
0
0 comments X

The pith

A multi-label benchmark with aggregated annotator votes shows recent MLLMs have advanced on visual emotion prediction but still leave substantial room for improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing visual emotion datasets rely on a single-candidate annotation scheme that forces annotators to judge one emotion at a time, missing how one image can evoke multiple emotions at different strengths. The paper counters this by building a new dataset where twenty annotators per image independently select every emotion they feel, then aggregates those selections into vote distributions over eight emotion categories. This produces a 10,344-image benchmark with 236,998 votes that supports both dominant-emotion accuracy and full-distribution matching tasks. When recent MLLMs such as Qwen3-VL, GPT, Gemini, and Claude are tested on the dataset, they display clear gains over earlier models yet still fall short of human distributions. Experiments further show that routing the task through an LLM-as-a-judge does not reliably raise performance.

Core claim

The paper establishes that a multi-label visual emotion benchmark created by aggregating independent selections from twenty annotators per image into vote distributions yields a more representative ground truth than prior single-label schemes. Evaluations of current MLLMs on this benchmark demonstrate measurable progress in both dominant emotion prediction and emotion distribution prediction, while also revealing persistent gaps; additionally, the LLM-as-a-judge approach does not consistently improve results on this subjective task.

What carries the argument

Aggregation of twenty independent annotator selections into per-image vote distributions across eight emotion categories, used as the evaluation target for both dominant and distributional prediction.

If this is right

  • MLLMs can now be measured against human vote distributions rather than single forced labels, giving a clearer picture of their multimodal emotion understanding.
  • Progress on the benchmark by models such as Qwen3-VL indicates recent advances in handling mixed visual signals, yet the remaining gap points to needed improvements in capturing intensity and multiplicity.
  • The LLM-as-a-judge technique shows inconsistent gains, implying it may not be a general solution for subjective perceptual tasks.
  • The dataset supplies a ready source of soft labels that could support training or calibration of future models on emotion distributions instead of hard single labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same twenty-annotator aggregation approach could be applied to other subjective image properties such as aesthetic quality or implied narrative to create more robust benchmarks.
  • Training models directly to match vote distributions rather than single labels might reduce overconfidence on ambiguous images.
  • Cultural or demographic differences in emotion perception could be quantified by repeating the annotation process with distinct annotator pools and comparing resulting distributions.

Load-bearing premise

Aggregating independent selections from twenty annotators per image produces a reliable and representative distribution of the emotions actually evoked by each image.

What would settle it

A replication study that collects fresh annotations from a new set of twenty annotators on the same images and finds statistically different emotion distributions would falsify the benchmark's claim to representativeness.

Figures

Figures reproduced from arXiv: 2605.14635 by Mo Fan, Ryotaro Shimizu, Takashi Wada, Takuya Furusawa, Tianwei Chen, Yuki Hirakawa.

Figure 1
Figure 1. Figure 1: We annotate a visual emotion analysis benchmark dataset across all candidate emotions and reveal the inaccurate labels from the original dataset. This dataset is used to evaluate MLLMs and the LLM-as-a-judge method in visual emotion analysis. analysis measures the model’s ability to predict the emotions evoked by an im￾age, a fundamental yet challenging task given the subjective nature of emotional percept… view at source ↗
Figure 2
Figure 2. Figure 2: Examples of our dataset. We collect images from EmoSet [29] and FI [31]. Each image is annotated by 20 annotators, and each annotator can vote for any of eight Mikels’ emotions [21]. The emotions in blue are the dominant emotion. the evaluations are based on previous verification-based datasets, the conclusions may warrant further scrutiny. In the last part of their user study, it appears that GPT-4o [23]’… view at source ↗
Figure 3
Figure 3. Figure 3: Dataset analysis on annotated emotion co-occurrence and the difference from the original labels. We observe that negative emotions are more likely to co-occur within a single image, and that images originally labeled as amusement and anger are frequently voted as other emotions by annotators. “both” and “neither”. We observe that 56.57% of submissions prefer our dominant emotions over the original labels, … view at source ↗
Figure 4
Figure 4. Figure 4: Examples of GPT-4o and GPT-5.1 outputs. The scores in orange are the most different between these two models. We observe that GPT-5.1 suppresses some emotions when other emotions are stronger. (Wasser), Kullback-Leibler divergence (KLdiv), Jensen–Shannon divergence (JS￾div), cosine similarity (Cosine), Spearman’s rank correlation coefficient (SRCC), and Pearson’s linear correlation coefficient (PLCC). Sinc… view at source ↗
Figure 5
Figure 5. Figure 5: Annotation workflow A Annotation details The overall annotation workflow is shown in [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Annotation interface We also enforce two MTurk qualification requirements: (1) at least 5, 000 approved Human Intelligence Tasks (HITs), and (2) an approval rate above 95% across all HITs. During the qualification task, annotators are considered qualified if they meet the following criteria: (1) submit at least 10 HITs; (2) make no mistakes on either type of verification question (i.e., dummy questions or … view at source ↗
Figure 7
Figure 7. Figure 7: Verification questions in the quality control process appropriate number of annotators required to obtain representative and reliable results. During the annotation, we assess agreement among annotators. After the annotation, we conduct an A/B test comparing our labels with those from the original dataset. Number of annotation per image. We conduct the same qualification task as introduced in Sec. A.2 to e… view at source ↗
Figure 8
Figure 8. Figure 8: Voting similarity between sam￾pled annotators and all 193 annotators. The green vertical line shows the num￾ber of annotators (i.e., 20) in our an￾notation [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Agreements (votes) on the dominant emotion. 99.36% of the images have at least one dominant emotion that is voted by at least 5 annotators. options both and neither when they feel that both emotions apply or that neither is appropriate. If our annotations were unreliable, annotators would be expected to prefer either the original label or select neither. To ensure a clear comparison, we select images where… view at source ↗
Figure 10
Figure 10. Figure 10: Prompt for the straight MLLM inference. B Experiment details B.1 Model versions The model versions for each MLLMs are shown follow. Qwens: [Qwen3-VL-2B-Instruct, Qwen3-VL-4B-Instruct, Qwen3-VL-8B-Instruct, Qwen3-VL-32B-Instruct] GPTs: [gpt-4o-2024-11-20, gpt-5-nano-2025-08-07, gpt-5-2025-08-07, gpt-5.1- 2025-11-13] Geminis: [gemini-2.0-flash, gemini-3-flash-preview] Claude: [claude-sonnet-4@20250514] B.2 … view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for the LLM-as-a-judge [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional examples of the annotation and the MLLM predictions. The scores in blue are the labeled dominant emotion, while the score in green are the predicted dominant emotions. The light green scores indicate that the model predict multiple dominant emotions on one image [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional examples of the emotion distribution prediction where the output of GPT-4o and GPT-5.1 are different. The scores in orange are the most different among two models and the labels [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional example of the LLM-as-a-judge verification. The GPT-5’s predic￾tion on disgust is reversed due to the effect of Gemini-2.5-flash, resulting a decreasing performance on emotion distribution prediction [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Extra example of the LLM-as-a-judge verification [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Extra example of the LLM-as-a-judge verification [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
read the original abstract

This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark dataset for visual emotion analysis toward MLLMs evaluation. We hire $20$ annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions. The resulting dataset contains $10,344$ images with $236,998$ valid votes across eight emotions. Based on this benchmark dataset, we evaluate several recent models, including Qwen3-VL, OpenAI's GPT, Gemini, and Claude. We assess model performance on both dominant emotion prediction and emotion distribution prediction. Our results demonstrate the progress achieved by recent MLLMs while also indicating that substantial room for improvement remains. Furthermore, our experiments with LLM-as-a-judge show that the method does not consistently improve MLLMs' performance, indicating its limitations for the subjective task of visual emotion analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MultiEmo-Bench, a multi-label visual emotion dataset of 10,344 images where 20 annotators per image independently select all applicable emotions from eight categories; votes are aggregated into per-image distributions that serve as ground truth. It evaluates recent MLLMs (Qwen3-VL, GPT variants, Gemini, Claude) on dominant-emotion prediction and full distribution prediction, reports measurable progress relative to prior single-label benchmarks, notes substantial remaining headroom, and finds that LLM-as-a-judge does not consistently improve results.

Significance. If the aggregated distributions prove stable, the benchmark supplies a more representative evaluation target for subjective visual-emotion tasks than existing single-label collections and can usefully quantify both advances and limitations in current MLLMs.

major comments (2)
  1. [Dataset construction] Dataset construction section: the paper collects independent selections from 20 annotators per image and treats the resulting counts as representative ground truth, yet reports no inter-annotator agreement statistics (multi-label Fleiss' kappa, average pairwise Jaccard, or split-half correlation on the 8D vote vectors). For a subjective labeling task this omission leaves open the possibility that label noise rather than model capability drives the observed performance gaps, directly weakening the central claim of measurable progress.
  2. [Evaluation] Evaluation section: the precise aggregation rule that converts the 20 binary selections into the final distribution (e.g., normalized counts, thresholding) and the exact metric definitions for distribution prediction (KL divergence, Earth-mover distance, or other) are not stated, preventing independent verification of the reported numbers.
minor comments (1)
  1. Clarify in the abstract and methods whether the 236,998 valid votes exclude images with zero selections or other filtering steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of MultiEmo-Bench. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the paper collects independent selections from 20 annotators per image and treats the resulting counts as representative ground truth, yet reports no inter-annotator agreement statistics (multi-label Fleiss' kappa, average pairwise Jaccard, or split-half correlation on the 8D vote vectors). For a subjective labeling task this omission leaves open the possibility that label noise rather than model capability drives the observed performance gaps, directly weakening the central claim of measurable progress.

    Authors: We agree that inter-annotator agreement statistics are necessary to establish the stability of the aggregated distributions for this subjective task. In the revised manuscript we will add multi-label Fleiss' kappa, average pairwise Jaccard similarity, and split-half correlation computed on the 8-dimensional vote vectors. These metrics will quantify label consistency and directly support the reliability of the ground-truth distributions used to demonstrate progress over prior single-label benchmarks. revision: yes

  2. Referee: [Evaluation] Evaluation section: the precise aggregation rule that converts the 20 binary selections into the final distribution (e.g., normalized counts, thresholding) and the exact metric definitions for distribution prediction (KL divergence, Earth-mover distance, or other) are not stated, preventing independent verification of the reported numbers.

    Authors: We apologize for the omission of these implementation details. The aggregation rule is normalized vote counts (number of annotators selecting each emotion divided by 20), with no thresholding. Distribution-prediction metrics are KL divergence and Earth Mover's Distance. We will insert explicit statements of both the aggregation procedure and the metric formulas into the Evaluation section of the revised manuscript to permit full reproducibility. revision: yes

Circularity Check

0 steps flagged

Benchmark labels constructed independently of model evaluations

full rationale

The paper collects 20 independent human annotations per image, aggregates the votes into fixed emotion distributions, and then evaluates external MLLMs (Qwen3-VL, GPT, Gemini, Claude) against those labels on dominant-emotion and distribution tasks. No equation or claim reduces a model prediction to a fitted parameter, self-definition, or self-citation chain; the ground truth is external to the evaluated models. The LLM-as-a-judge ablation is likewise a direct comparison against the same fixed labels. This is a standard benchmark-construction workflow with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a fixed set of eight emotions and majority-vote aggregation capture representative emotional responses; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Images evoke emotions that can be categorized into a fixed set of eight emotions.
    The benchmark is constructed around this fixed taxonomy without further justification or derivation in the abstract.

pith-pipeline@v0.9.0 · 5626 in / 1222 out tokens · 43411 ms · 2026-05-15T05:43:20.081381+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 5 internal anchors

  1. [1]

    In: CVPR

    Achlioptas,P.,Ovsjanikov,M.,Haydarov,K.,Elhoseiny,M.,Guibas,L.J.:Artemis: Affective language for visual art. In: CVPR. pp. 11569–11579 (2021)

  2. [2]

    Anthropic: System card: Claude opus 4 & claude sonnet 4. Tech. rep., Anthropic (May 2025)

  3. [3]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  4. [4]

    In: Chiruzzo, L., Ritter, A., Wang, L

    Bhattacharyya, S., Wang, J.Z.: Evaluating vision-language models for emotion recognition. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) NAACL Findings. pp. 1798–1820. Association for Computational Linguistics (2025)

  5. [5]

    In: ECCV

    Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. In: ECCV. vol. 15075, pp. 370–387 (2024)

  6. [6]

    In: NeurIPS (2024)

    Cheng, Z., Cheng, Z., He, J., Wang, K., Lin, Y., Lian, Z., Peng, X., Hauptmann, A.G.: Emotion-llama: Multimodal emotion recognition and reasoning with instruc- tion tuning. In: NeurIPS (2024)

  7. [7]

    In: ICCV

    Dang, S., He, Y., Ling, L., Qian, Z., Zhao, N., Cao, N.: Emoticrafter: Text-to- emotional-image generation based on valence-arousal model. In: ICCV. pp. 15218– 15228 (October 2025)

  8. [8]

    Deng, K., Ray, A., Tan, R., Gabriel, S., Plummer, B.A., Saenko, K.: Socratis: Are large multimodal models emotionally aware? arXiv e-printsabs/2308.16741 (2023)

  9. [9]

    In: Gurrin, C., Schoeffmann, K., Zhang, M., Rossetto, L., Rudinac, S., Dang-Nguyen, D., Cheng, W., Chen, P., Benois-Pineau, J

    Gao, L., Jia, Z., Zeng, Y., Sun, W., Zhang, Y., Zhou, W., Zhai, G., Min, X.: Eemo-bench: A benchmark for multi-modal large language models on image evoked emotion assessment. In: Gurrin, C., Schoeffmann, K., Zhang, M., Rossetto, L., Rudinac, S., Dang-Nguyen, D., Cheng, W., Chen, P., Benois-Pineau, J. (eds.) ACM MM. pp. 7064–7073 (2025)

  10. [10]

    Google DeepMind: Gemini 3 flash model card (Feb 2026)

  11. [11]

    EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis

    Guo, Y., Hong, D., Chen, W., She, Z., Ye, C., Chang, X., Mao, Z.: Emoverse: A mllms-driven emotion representation dataset for interpretable visual emotion analysis. arXiv e-printsabs/2511.12554(2025)

  12. [12]

    In: ACM MM

    Huang, Y., Sheng, X., Yang, Z., Yuan, Q., Duan, Z., Chen, P., Li, L., Lin, W., Shi, G.: Aesexpert: Towards multi-modality foundation model for image aesthetics perception. In: ACM MM. pp. 5911–5920 (2024)

  13. [13]

    arXiv e-printsabs/2401.08276(2024) 16 T

    Huang, Y., Yuan, Q., Sheng, X., Yang, Z., Wu, H., Chen, P., Yang, Y., Li, L., Lin, W.: Aesbench: An expert benchmark for multimodal large language models on image aesthetics perception. arXiv e-printsabs/2401.08276(2024) 16 T. Chen et al

  14. [14]

    NIMH Center for the Study of Emotion and Attention1(39-58), 3 (1997)

    Lang, P.J., Bradley, M.M., Cuthbert, B.N., et al.: International affective picture system (iaps): Technical manual and affective ratings. NIMH Center for the Study of Emotion and Attention1(39-58), 3 (1997)

  15. [15]

    In: ICML

    Lian, Z., Sun, H., Sun, L., Chen, H., Chen, L., Gu, H., Wen, Z., Chen, S., Zhang, S., Yao, H., Liu, B., Liu, R., Liang, S., Li, Y., Yi, J., Tao, J.: OV-MER: towards open-vocabulary multimodal emotion recognition. In: ICML. vol. 267 (2025)

  16. [16]

    arXiv e-printsabs/2407.07653 (2024)

    Lian, Z., Sun, H., Sun, L., Yi, J., Liu, B., Tao, J.: Affectgpt: Dataset and framework for explainable multimodal emotion recognition. arXiv e-printsabs/2407.07653 (2024)

  17. [17]

    arXiv e-printsabs/2503.23907(2025)

    Liao, Z., Liu, X., Qin, W., Li, Q., Wang, Q., Wan, P., Zhang, D., Zeng, L., Feng, P.: Humanaesexpert: Advancing a multi-modality foundation model for human image aesthetic assessment. arXiv e-printsabs/2503.23907(2025)

  18. [18]

    In: CVPRW

    Lin, Y., Sun, J., Cheng, Z., Wang, J., Liang, H., Cheng, Z., Dong, Y., He, J., Peng, X., Hua, X.: Why we feel: Breaking boundaries in emotional reasoning with multimodal large language models. In: CVPRW. pp. 5196–5206 (2025)

  19. [19]

    In: Bimbo, A.D., Chang, S., Smeulders, A.W.M

    Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychology and art theory. In: Bimbo, A.D., Chang, S., Smeulders, A.W.M. (eds.) ACM MM. pp. 83–92. ACM (2010)

  20. [20]

    In: Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J.M., Zhang, C

    Mertens, L., Yargholi, E., de Beeck, H.P.O., den Stock, J.V., Vennekens, J.: Find- ingemo: An image dataset for emotion recognition in the wild. In: Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J.M., Zhang, C. (eds.) AAAI (2024)

  21. [21]

    Behavior Research Methods37, 626–630 (2005)

    Mikels, J.A., Fredrickson, B.L., Larkin, G.R.S., Lindberg, C.M., Maglio, S.J., Reuter-Lorenz, P.A.: Emotional category data on images from the international affective picture system. Behavior Research Methods37, 626–630 (2005)

  22. [22]

    In: CVPR

    Mohamed, Y., Khan, F.F., Haydarov, K., Elhoseiny, M.: It is okay to not be okay: Overcoming emotional bias in affective image captioning by contrastive data col- lection. In: CVPR. pp. 21231–21240 (2022)

  23. [23]

    GPT-4o System Card

    OpenAI: Gpt-4o system card. arXiv e-printsabs/2410.21276(2024)

  24. [24]

    OpenAI: Gpt-5.1 instant and gpt-5.1 thinking system card addendum (Nov 2025)

  25. [25]

    OpenAI GPT-5 System Card

    OpenAI: Openai GPT-5 system card. arXiv e-printsabs/2601.03267(2026)

  26. [26]

    In: CVPR

    Peng, K., Chen, T., Sadovnik, A., Gallagher, A.C.: A mixed bag of emotions: Model, predict, and transfer emotion distributions. In: CVPR. pp. 860–868 (2015)

  27. [27]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Team, G.: Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities. arXiv e-prints abs/2507.06261(2025)

  28. [28]

    In: CVPR

    Xie, H., Peng, C., Tseng, Y., Chen, H., Hsu, C., Shuai, H., Cheng, W.: Emovit: Revolutionizing emotion insights with visual instruction tuning. In: CVPR. pp. 26586–26595 (2024)

  29. [29]

    In: ICCV

    Yang, J., Huang, Q., Ding, T., Lischinski, D., Cohen-Or, D., Huang, H.: Emoset: A large-scale visual emotion dataset with rich attributes. In: ICCV. pp. 20326–20337 (2023)

  30. [30]

    In: Singh, S., Markovitch, S

    Yang, J., Sun, M., Sun, X.: Learning visual sentiment distributions via augmented conditional probability neural network. In: Singh, S., Markovitch, S. (eds.) AAAI. pp. 224–230 (2017)

  31. [31]

    In: Schuurmans, D., Wellman, M.P

    You, Q., Luo, J., Jin, H., Yang, J.: Building a large scale dataset for image emotion recognition: The fine print and the benchmark. In: Schuurmans, D., Wellman, M.P. (eds.) AAAI. pp. 308–314 (2016)

  32. [32]

    In: ACM MM

    Zhang, C., Xie, H., Wen, B., Zuo, S., Zhang, R., Cheng, W.: Emoart: A mul- tidimensional dataset for emotion-aware artistic generation. In: ACM MM. pp. 12644–12650 (2025) Title Suppressed Due to Excessive Length 17

  33. [33]

    TPAMI44(10), 6729–6751 (2022)

    Zhao, S., Yao, X., Yang, J., Jia, G., Ding, G., Chua, T., Schuller, B.W., Keutzer, K.: Affective image content analysis: Two decades review and new perspectives. TPAMI44(10), 6729–6751 (2022)

  34. [34]

    which emotion(s) do you feel

    Zhou, H., Tang, L., Yang, R., Qin, G., Zhang, Y., Hu, R., Li, X.: Uniqa: Unified vision-language pre-training for image quality and aesthetic assessment. arXiv e- printsabs/2406.01069(2024) 18 T. Chen et al. Do you think this image will makes people feel ${emotion}? GPT-5: Yes Gemini-2.5-flash: Yes Claude Sonnet 4: Yes -- - YesYes - - - Ang.AweCon.Exci.Am...