arxiv: 2605.14635 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models

Tianwei Chen , Takuya Furusawa , Yuki Hirakawa , Ryotaro Shimizu , Mo Fan , Takashi Wada

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multi-label visual emotionemotion distribution predictionMLLM evaluationannotation aggregationbenchmark datasetvisual emotion analysismultimodal modelsLLM-as-a-judge

0 comments

The pith

A multi-label benchmark with aggregated annotator votes shows recent MLLMs have advanced on visual emotion prediction but still leave substantial room for improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing visual emotion datasets rely on a single-candidate annotation scheme that forces annotators to judge one emotion at a time, missing how one image can evoke multiple emotions at different strengths. The paper counters this by building a new dataset where twenty annotators per image independently select every emotion they feel, then aggregates those selections into vote distributions over eight emotion categories. This produces a 10,344-image benchmark with 236,998 votes that supports both dominant-emotion accuracy and full-distribution matching tasks. When recent MLLMs such as Qwen3-VL, GPT, Gemini, and Claude are tested on the dataset, they display clear gains over earlier models yet still fall short of human distributions. Experiments further show that routing the task through an LLM-as-a-judge does not reliably raise performance.

Core claim

The paper establishes that a multi-label visual emotion benchmark created by aggregating independent selections from twenty annotators per image into vote distributions yields a more representative ground truth than prior single-label schemes. Evaluations of current MLLMs on this benchmark demonstrate measurable progress in both dominant emotion prediction and emotion distribution prediction, while also revealing persistent gaps; additionally, the LLM-as-a-judge approach does not consistently improve results on this subjective task.

What carries the argument

Aggregation of twenty independent annotator selections into per-image vote distributions across eight emotion categories, used as the evaluation target for both dominant and distributional prediction.

If this is right

MLLMs can now be measured against human vote distributions rather than single forced labels, giving a clearer picture of their multimodal emotion understanding.
Progress on the benchmark by models such as Qwen3-VL indicates recent advances in handling mixed visual signals, yet the remaining gap points to needed improvements in capturing intensity and multiplicity.
The LLM-as-a-judge technique shows inconsistent gains, implying it may not be a general solution for subjective perceptual tasks.
The dataset supplies a ready source of soft labels that could support training or calibration of future models on emotion distributions instead of hard single labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same twenty-annotator aggregation approach could be applied to other subjective image properties such as aesthetic quality or implied narrative to create more robust benchmarks.
Training models directly to match vote distributions rather than single labels might reduce overconfidence on ambiguous images.
Cultural or demographic differences in emotion perception could be quantified by repeating the annotation process with distinct annotator pools and comparing resulting distributions.

Load-bearing premise

Aggregating independent selections from twenty annotators per image produces a reliable and representative distribution of the emotions actually evoked by each image.

What would settle it

A replication study that collects fresh annotations from a new set of twenty annotators on the same images and finds statistically different emotion distributions would falsify the benchmark's claim to representativeness.

Figures

Figures reproduced from arXiv: 2605.14635 by Mo Fan, Ryotaro Shimizu, Takashi Wada, Takuya Furusawa, Tianwei Chen, Yuki Hirakawa.

**Figure 1.** Figure 1: We annotate a visual emotion analysis benchmark dataset across all candidate emotions and reveal the inaccurate labels from the original dataset. This dataset is used to evaluate MLLMs and the LLM-as-a-judge method in visual emotion analysis. analysis measures the model’s ability to predict the emotions evoked by an image, a fundamental yet challenging task given the subjective nature of emotional percept… view at source ↗

**Figure 2.** Figure 2: Examples of our dataset. We collect images from EmoSet [29] and FI [31]. Each image is annotated by 20 annotators, and each annotator can vote for any of eight Mikels’ emotions [21]. The emotions in blue are the dominant emotion. the evaluations are based on previous verification-based datasets, the conclusions may warrant further scrutiny. In the last part of their user study, it appears that GPT-4o [23]’… view at source ↗

**Figure 3.** Figure 3: Dataset analysis on annotated emotion co-occurrence and the difference from the original labels. We observe that negative emotions are more likely to co-occur within a single image, and that images originally labeled as amusement and anger are frequently voted as other emotions by annotators. “both” and “neither”. We observe that 56.57% of submissions prefer our dominant emotions over the original labels, … view at source ↗

**Figure 4.** Figure 4: Examples of GPT-4o and GPT-5.1 outputs. The scores in orange are the most different between these two models. We observe that GPT-5.1 suppresses some emotions when other emotions are stronger. (Wasser), Kullback-Leibler divergence (KLdiv), Jensen–Shannon divergence (JSdiv), cosine similarity (Cosine), Spearman’s rank correlation coefficient (SRCC), and Pearson’s linear correlation coefficient (PLCC). Sinc… view at source ↗

**Figure 5.** Figure 5: Annotation workflow A Annotation details The overall annotation workflow is shown in [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Annotation interface We also enforce two MTurk qualification requirements: (1) at least 5, 000 approved Human Intelligence Tasks (HITs), and (2) an approval rate above 95% across all HITs. During the qualification task, annotators are considered qualified if they meet the following criteria: (1) submit at least 10 HITs; (2) make no mistakes on either type of verification question (i.e., dummy questions or … view at source ↗

**Figure 7.** Figure 7: Verification questions in the quality control process appropriate number of annotators required to obtain representative and reliable results. During the annotation, we assess agreement among annotators. After the annotation, we conduct an A/B test comparing our labels with those from the original dataset. Number of annotation per image. We conduct the same qualification task as introduced in Sec. A.2 to e… view at source ↗

**Figure 8.** Figure 8: Voting similarity between sampled annotators and all 193 annotators. The green vertical line shows the number of annotators (i.e., 20) in our annotation [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Agreements (votes) on the dominant emotion. 99.36% of the images have at least one dominant emotion that is voted by at least 5 annotators. options both and neither when they feel that both emotions apply or that neither is appropriate. If our annotations were unreliable, annotators would be expected to prefer either the original label or select neither. To ensure a clear comparison, we select images where… view at source ↗

**Figure 10.** Figure 10: Prompt for the straight MLLM inference. B Experiment details B.1 Model versions The model versions for each MLLMs are shown follow. Qwens: [Qwen3-VL-2B-Instruct, Qwen3-VL-4B-Instruct, Qwen3-VL-8B-Instruct, Qwen3-VL-32B-Instruct] GPTs: [gpt-4o-2024-11-20, gpt-5-nano-2025-08-07, gpt-5-2025-08-07, gpt-5.1- 2025-11-13] Geminis: [gemini-2.0-flash, gemini-3-flash-preview] Claude: [claude-sonnet-4@20250514] B.2 … view at source ↗

**Figure 11.** Figure 11: Prompt for the LLM-as-a-judge [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Additional examples of the annotation and the MLLM predictions. The scores in blue are the labeled dominant emotion, while the score in green are the predicted dominant emotions. The light green scores indicate that the model predict multiple dominant emotions on one image [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Additional examples of the emotion distribution prediction where the output of GPT-4o and GPT-5.1 are different. The scores in orange are the most different among two models and the labels [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Additional example of the LLM-as-a-judge verification. The GPT-5’s prediction on disgust is reversed due to the effect of Gemini-2.5-flash, resulting a decreasing performance on emotion distribution prediction [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Extra example of the LLM-as-a-judge verification [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Extra example of the LLM-as-a-judge verification [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

read the original abstract

This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark dataset for visual emotion analysis toward MLLMs evaluation. We hire $20$ annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions. The resulting dataset contains $10,344$ images with $236,998$ valid votes across eight emotions. Based on this benchmark dataset, we evaluate several recent models, including Qwen3-VL, OpenAI's GPT, Gemini, and Claude. We assess model performance on both dominant emotion prediction and emotion distribution prediction. Our results demonstrate the progress achieved by recent MLLMs while also indicating that substantial room for improvement remains. Furthermore, our experiments with LLM-as-a-judge show that the method does not consistently improve MLLMs' performance, indicating its limitations for the subjective task of visual emotion analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper's main value is a new multi-label visual emotion dataset built from twenty annotators per image with vote aggregation, which gives a more realistic target than single-label sets, though the lack of agreement stats is a noticeable gap.

read the letter

The paper's core move is to replace the old single-candidate annotation style with a setup where twenty people independently pick every emotion they feel from each image, then the votes are tallied into an eight-dimensional distribution. They end up with over ten thousand images and nearly two hundred forty thousand votes, then run current MLLMs on both dominant-emotion accuracy and full-distribution matching. The numbers show newer models like Qwen3-VL and the latest GPT and Gemini versions do better than earlier ones, but none are close to human-level on the distribution task, and swapping in an LLM judge does not produce reliable gains. That part is straightforward and useful. The shift to distributions directly fixes the limitation they describe in prior datasets, where forcing a single yes/no choice understates what an image can evoke. The scale and the dual evaluation protocol are practical improvements that anyone building emotion-aware models can actually use. The soft spot is the missing reliability check on the labels themselves. Nothing is reported on how consistent the twenty annotators are with one another, whether through pairwise overlap, split-half correlation on the vectors, or any multi-rater statistic. On a subjective task like evoked emotion that matters, because noisy ground truth makes it hard to tell whether a model is improving or just fitting the noise. The abstract and evaluation sections stay at a high level on the exact aggregation rules and metric definitions, so a reader has to take the progress claim partly on trust until the full details are checked. This is the kind of benchmark paper that belongs in the reading group for people working on affective computing or multimodal evaluation. It is not a theoretical breakthrough, but the dataset is a concrete resource that could be reused. I would send it to peer review because the construction is independent of the model results and the central claim is testable once the agreement numbers are added or clarified.

Referee Report

2 major / 1 minor

Summary. The paper introduces MultiEmo-Bench, a multi-label visual emotion dataset of 10,344 images where 20 annotators per image independently select all applicable emotions from eight categories; votes are aggregated into per-image distributions that serve as ground truth. It evaluates recent MLLMs (Qwen3-VL, GPT variants, Gemini, Claude) on dominant-emotion prediction and full distribution prediction, reports measurable progress relative to prior single-label benchmarks, notes substantial remaining headroom, and finds that LLM-as-a-judge does not consistently improve results.

Significance. If the aggregated distributions prove stable, the benchmark supplies a more representative evaluation target for subjective visual-emotion tasks than existing single-label collections and can usefully quantify both advances and limitations in current MLLMs.

major comments (2)

[Dataset construction] Dataset construction section: the paper collects independent selections from 20 annotators per image and treats the resulting counts as representative ground truth, yet reports no inter-annotator agreement statistics (multi-label Fleiss' kappa, average pairwise Jaccard, or split-half correlation on the 8D vote vectors). For a subjective labeling task this omission leaves open the possibility that label noise rather than model capability drives the observed performance gaps, directly weakening the central claim of measurable progress.
[Evaluation] Evaluation section: the precise aggregation rule that converts the 20 binary selections into the final distribution (e.g., normalized counts, thresholding) and the exact metric definitions for distribution prediction (KL divergence, Earth-mover distance, or other) are not stated, preventing independent verification of the reported numbers.

minor comments (1)

Clarify in the abstract and methods whether the 236,998 valid votes exclude images with zero selections or other filtering steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of MultiEmo-Bench. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: the paper collects independent selections from 20 annotators per image and treats the resulting counts as representative ground truth, yet reports no inter-annotator agreement statistics (multi-label Fleiss' kappa, average pairwise Jaccard, or split-half correlation on the 8D vote vectors). For a subjective labeling task this omission leaves open the possibility that label noise rather than model capability drives the observed performance gaps, directly weakening the central claim of measurable progress.

Authors: We agree that inter-annotator agreement statistics are necessary to establish the stability of the aggregated distributions for this subjective task. In the revised manuscript we will add multi-label Fleiss' kappa, average pairwise Jaccard similarity, and split-half correlation computed on the 8-dimensional vote vectors. These metrics will quantify label consistency and directly support the reliability of the ground-truth distributions used to demonstrate progress over prior single-label benchmarks. revision: yes
Referee: [Evaluation] Evaluation section: the precise aggregation rule that converts the 20 binary selections into the final distribution (e.g., normalized counts, thresholding) and the exact metric definitions for distribution prediction (KL divergence, Earth-mover distance, or other) are not stated, preventing independent verification of the reported numbers.

Authors: We apologize for the omission of these implementation details. The aggregation rule is normalized vote counts (number of annotators selecting each emotion divided by 20), with no thresholding. Distribution-prediction metrics are KL divergence and Earth Mover's Distance. We will insert explicit statements of both the aggregation procedure and the metric formulas into the Evaluation section of the revised manuscript to permit full reproducibility. revision: yes

Circularity Check

0 steps flagged

Benchmark labels constructed independently of model evaluations

full rationale

The paper collects 20 independent human annotations per image, aggregates the votes into fixed emotion distributions, and then evaluates external MLLMs (Qwen3-VL, GPT, Gemini, Claude) against those labels on dominant-emotion and distribution tasks. No equation or claim reduces a model prediction to a fitted parameter, self-definition, or self-citation chain; the ground truth is external to the evaluated models. The LLM-as-a-judge ablation is likewise a direct comparison against the same fixed labels. This is a standard benchmark-construction workflow with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a fixed set of eight emotions and majority-vote aggregation capture representative emotional responses; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Images evoke emotions that can be categorized into a fixed set of eight emotions.
The benchmark is constructed around this fixed taxonomy without further justification or derivation in the abstract.

pith-pipeline@v0.9.0 · 5626 in / 1222 out tokens · 43411 ms · 2026-05-15T05:43:20.081381+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We hire 20 annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 5 internal anchors

[1]

In: CVPR

Achlioptas,P.,Ovsjanikov,M.,Haydarov,K.,Elhoseiny,M.,Guibas,L.J.:Artemis: Affective language for visual art. In: CVPR. pp. 11569–11579 (2021)

work page 2021
[2]

Anthropic: System card: Claude opus 4 & claude sonnet 4. Tech. rep., Anthropic (May 2025)

work page 2025
[3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: Chiruzzo, L., Ritter, A., Wang, L

Bhattacharyya, S., Wang, J.Z.: Evaluating vision-language models for emotion recognition. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) NAACL Findings. pp. 1798–1820. Association for Computational Linguistics (2025)

work page 2025
[5]

In: ECCV

Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. In: ECCV. vol. 15075, pp. 370–387 (2024)

work page 2024
[6]

In: NeurIPS (2024)

Cheng, Z., Cheng, Z., He, J., Wang, K., Lin, Y., Lian, Z., Peng, X., Hauptmann, A.G.: Emotion-llama: Multimodal emotion recognition and reasoning with instruc- tion tuning. In: NeurIPS (2024)

work page 2024
[7]

In: ICCV

Dang, S., He, Y., Ling, L., Qian, Z., Zhao, N., Cao, N.: Emoticrafter: Text-to- emotional-image generation based on valence-arousal model. In: ICCV. pp. 15218– 15228 (October 2025)

work page 2025
[8]

Deng, K., Ray, A., Tan, R., Gabriel, S., Plummer, B.A., Saenko, K.: Socratis: Are large multimodal models emotionally aware? arXiv e-printsabs/2308.16741 (2023)

work page arXiv 2023
[9]

In: Gurrin, C., Schoeffmann, K., Zhang, M., Rossetto, L., Rudinac, S., Dang-Nguyen, D., Cheng, W., Chen, P., Benois-Pineau, J

Gao, L., Jia, Z., Zeng, Y., Sun, W., Zhang, Y., Zhou, W., Zhai, G., Min, X.: Eemo-bench: A benchmark for multi-modal large language models on image evoked emotion assessment. In: Gurrin, C., Schoeffmann, K., Zhang, M., Rossetto, L., Rudinac, S., Dang-Nguyen, D., Cheng, W., Chen, P., Benois-Pineau, J. (eds.) ACM MM. pp. 7064–7073 (2025)

work page 2025
[10]

Google DeepMind: Gemini 3 flash model card (Feb 2026)

work page 2026
[11]

EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis

Guo, Y., Hong, D., Chen, W., She, Z., Ye, C., Chang, X., Mao, Z.: Emoverse: A mllms-driven emotion representation dataset for interpretable visual emotion analysis. arXiv e-printsabs/2511.12554(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

In: ACM MM

Huang, Y., Sheng, X., Yang, Z., Yuan, Q., Duan, Z., Chen, P., Li, L., Lin, W., Shi, G.: Aesexpert: Towards multi-modality foundation model for image aesthetics perception. In: ACM MM. pp. 5911–5920 (2024)

work page 2024
[13]

arXiv e-printsabs/2401.08276(2024) 16 T

Huang, Y., Yuan, Q., Sheng, X., Yang, Z., Wu, H., Chen, P., Yang, Y., Li, L., Lin, W.: Aesbench: An expert benchmark for multimodal large language models on image aesthetics perception. arXiv e-printsabs/2401.08276(2024) 16 T. Chen et al

work page arXiv 2024
[14]

NIMH Center for the Study of Emotion and Attention1(39-58), 3 (1997)

Lang, P.J., Bradley, M.M., Cuthbert, B.N., et al.: International affective picture system (iaps): Technical manual and affective ratings. NIMH Center for the Study of Emotion and Attention1(39-58), 3 (1997)

work page 1997
[15]

In: ICML

Lian, Z., Sun, H., Sun, L., Chen, H., Chen, L., Gu, H., Wen, Z., Chen, S., Zhang, S., Yao, H., Liu, B., Liu, R., Liang, S., Li, Y., Yi, J., Tao, J.: OV-MER: towards open-vocabulary multimodal emotion recognition. In: ICML. vol. 267 (2025)

work page 2025
[16]

arXiv e-printsabs/2407.07653 (2024)

Lian, Z., Sun, H., Sun, L., Yi, J., Liu, B., Tao, J.: Affectgpt: Dataset and framework for explainable multimodal emotion recognition. arXiv e-printsabs/2407.07653 (2024)

work page arXiv 2024
[17]

arXiv e-printsabs/2503.23907(2025)

Liao, Z., Liu, X., Qin, W., Li, Q., Wang, Q., Wan, P., Zhang, D., Zeng, L., Feng, P.: Humanaesexpert: Advancing a multi-modality foundation model for human image aesthetic assessment. arXiv e-printsabs/2503.23907(2025)

work page arXiv 2025
[18]

In: CVPRW

Lin, Y., Sun, J., Cheng, Z., Wang, J., Liang, H., Cheng, Z., Dong, Y., He, J., Peng, X., Hua, X.: Why we feel: Breaking boundaries in emotional reasoning with multimodal large language models. In: CVPRW. pp. 5196–5206 (2025)

work page 2025
[19]

In: Bimbo, A.D., Chang, S., Smeulders, A.W.M

Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychology and art theory. In: Bimbo, A.D., Chang, S., Smeulders, A.W.M. (eds.) ACM MM. pp. 83–92. ACM (2010)

work page 2010
[20]

In: Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J.M., Zhang, C

Mertens, L., Yargholi, E., de Beeck, H.P.O., den Stock, J.V., Vennekens, J.: Find- ingemo: An image dataset for emotion recognition in the wild. In: Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J.M., Zhang, C. (eds.) AAAI (2024)

work page 2024
[21]

Behavior Research Methods37, 626–630 (2005)

Mikels, J.A., Fredrickson, B.L., Larkin, G.R.S., Lindberg, C.M., Maglio, S.J., Reuter-Lorenz, P.A.: Emotional category data on images from the international affective picture system. Behavior Research Methods37, 626–630 (2005)

work page 2005
[22]

In: CVPR

Mohamed, Y., Khan, F.F., Haydarov, K., Elhoseiny, M.: It is okay to not be okay: Overcoming emotional bias in affective image captioning by contrastive data col- lection. In: CVPR. pp. 21231–21240 (2022)

work page 2022
[23]

GPT-4o System Card

OpenAI: Gpt-4o system card. arXiv e-printsabs/2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

OpenAI: Gpt-5.1 instant and gpt-5.1 thinking system card addendum (Nov 2025)

work page 2025
[25]

OpenAI GPT-5 System Card

OpenAI: Openai GPT-5 system card. arXiv e-printsabs/2601.03267(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

In: CVPR

Peng, K., Chen, T., Sadovnik, A., Gallagher, A.C.: A mixed bag of emotions: Model, predict, and transfer emotion distributions. In: CVPR. pp. 860–868 (2015)

work page 2015
[27]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Team, G.: Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities. arXiv e-prints abs/2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

In: CVPR

Xie, H., Peng, C., Tseng, Y., Chen, H., Hsu, C., Shuai, H., Cheng, W.: Emovit: Revolutionizing emotion insights with visual instruction tuning. In: CVPR. pp. 26586–26595 (2024)

work page 2024
[29]

In: ICCV

Yang, J., Huang, Q., Ding, T., Lischinski, D., Cohen-Or, D., Huang, H.: Emoset: A large-scale visual emotion dataset with rich attributes. In: ICCV. pp. 20326–20337 (2023)

work page 2023
[30]

In: Singh, S., Markovitch, S

Yang, J., Sun, M., Sun, X.: Learning visual sentiment distributions via augmented conditional probability neural network. In: Singh, S., Markovitch, S. (eds.) AAAI. pp. 224–230 (2017)

work page 2017
[31]

In: Schuurmans, D., Wellman, M.P

You, Q., Luo, J., Jin, H., Yang, J.: Building a large scale dataset for image emotion recognition: The fine print and the benchmark. In: Schuurmans, D., Wellman, M.P. (eds.) AAAI. pp. 308–314 (2016)

work page 2016
[32]

In: ACM MM

Zhang, C., Xie, H., Wen, B., Zuo, S., Zhang, R., Cheng, W.: Emoart: A mul- tidimensional dataset for emotion-aware artistic generation. In: ACM MM. pp. 12644–12650 (2025) Title Suppressed Due to Excessive Length 17

work page 2025
[33]

TPAMI44(10), 6729–6751 (2022)

Zhao, S., Yao, X., Yang, J., Jia, G., Ding, G., Chua, T., Schuller, B.W., Keutzer, K.: Affective image content analysis: Two decades review and new perspectives. TPAMI44(10), 6729–6751 (2022)

work page 2022
[34]

which emotion(s) do you feel

Zhou, H., Tang, L., Yang, R., Qin, G., Zhang, Y., Hu, R., Li, X.: Uniqa: Unified vision-language pre-training for image quality and aesthetic assessment. arXiv e- printsabs/2406.01069(2024) 18 T. Chen et al. Do you think this image will makes people feel ${emotion}? GPT-5: Yes Gemini-2.5-flash: Yes Claude Sonnet 4: Yes -- - YesYes - - - Ang.AweCon.Exci.Am...

work page arXiv 2024