pith. sign in

arxiv: 2511.12554 · v2 · submitted 2025-11-16 · 💻 cs.CV

EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis

Pith reviewed 2026-05-17 21:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual emotion analysisemotion datasetB-A-S tripletsinterpretable annotationsMLLM pipelinecategorical emotion statesdimensional emotion spacegrounded visual reasoning
0
0 comments X p. Extension

The pith

EmoVerse dataset decomposes emotions in images into Background-Attribute-Subject triplets grounded to visual regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EmoVerse as a large dataset of over 219k images that breaks visual emotions down into specific Background-Attribute-Subject triplets, each tied to particular parts of the image. This approach moves past single overall emotion labels by adding word-level and subject-level reasoning, while supplying both categorical emotion states and dimensional emotion space annotations for the same content. A multi-stage pipeline built around multi-modal large language models handles most of the annotation work with only light human checks. If the claims hold, researchers gain a resource that supports models capable of explaining which visual elements drive a given emotional response instead of just outputting a label.

Core claim

EmoVerse supplies a knowledge-graph-inspired annotation scheme that decomposes emotions into Background-Attribute-Subject triplets grounded to image regions, paired with dual CES and DES labels across more than 219k images, all produced through a multi-stage MLLM pipeline that keeps human effort low, and this structure directly enables interpretable visual emotion analysis together with a model that maps visual cues to dimensional representations while providing attribution explanations.

What carries the argument

The Background-Attribute-Subject (B-A-S) triplet decomposition that grounds each emotional element to specific visual regions in the image.

If this is right

  • The dataset directly supports word-level and subject-level emotional reasoning by linking each triplet element to image regions.
  • Dual CES and DES annotations allow the same images to serve both discrete category and continuous dimension analyses.
  • The accompanying model maps visual cues into dimensional emotion representations while supplying attribution explanations for each prediction.
  • The pipeline reduces the scale of manual annotation needed while maintaining the structure required for interpretable outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The triplet grounding could be applied to video sequences to track how emotional elements change over time.
  • Content recommendation systems might use the region-specific labels to filter or prioritize images based on targeted emotional triggers.
  • Cross-cultural tests on the dataset could reveal whether the language-model-generated annotations carry biases in emotional interpretation.

Load-bearing premise

The multi-stage pipeline assumes that outputs from multi-modal large language models contain few systematic biases or inconsistencies and therefore require only light human verification to reach high reliability.

What would settle it

A controlled check in which independent human annotators compare the provided B-A-S triplet labels against the actual image content on a random sample and report low agreement rates would show the annotations do not reliably capture the intended elements.

Figures

Figures reproduced from arXiv: 2511.12554 by Cheng Ye, Dexiang Hong, Weidong Chen, Xiaojun Chang, Yijie Guo, Zhendong Mao, Zihan She.

Figure 1
Figure 1. Figure 1: EmoVerse Dataset introduces the first large-scale visual emotion dataset that combines Categorical Emotion States (CES) and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EmoVerse. EmoVerse collects images from multiple sources. Images collected pass through Annotation and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of our Interpretable Model. Model fine [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Knowledge graph based on B-A-S triplets. Decoupled [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Emotion category distribution statistics. Colored segments show the percentage of each category. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of emotion cloud use DES embeddings, projected through MDS. DES trained with full attribution exhibits the most [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of comparison. The model fine-tuned by [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Visual Emotion Analysis (VEA) aims to bridge the affective gap between visual content and human emotional responses. Despite its promise, progress in this field remains limited by the lack of open-source and interpretable datasets. Most existing studies assign a single discrete emotion label to an entire image, offering limited insight into how visual elements contribute to emotion. In this work, we introduce EmoVerse, a large-scale open-source dataset that enables interpretable visual emotion analysis through multi-layered, knowledge-graph-inspired annotations. By decomposing emotions into Background-Attribute-Subject (B-A-S) triplets and grounding each element to visual regions, EmoVerse provides word-level and subject-level emotional reasoning. With over 219k images, the dataset further includes dual annotations in Categorical Emotion States (CES) and Dimensional Emotion Space (DES), facilitating unified discrete and continuous emotion representation. A novel multi-stage pipeline ensures high annotation reliability with minimal human effort. Finally, we introduce an interpretable model that maps visual cues into DES representations and provides detailed attribution explanations. Together, the dataset, pipeline, and model form a comprehensive foundation for advancing explainable high-level emotion understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces EmoVerse, a large-scale open-source dataset of over 219k images for interpretable visual emotion analysis. Emotions are decomposed into Background-Attribute-Subject (B-A-S) triplets with visual region grounding to enable word-level and subject-level reasoning; dual annotations are provided in both Categorical Emotion States (CES) and Dimensional Emotion Space (DES). A novel multi-stage MLLM pipeline is used to generate the annotations with claimed high reliability and minimal human effort, and an interpretable model is presented that maps visual cues to DES representations with attribution explanations.

Significance. If the reliability of the B-A-S annotations and dual CES/DES labels is demonstrated, the dataset would provide a valuable resource for advancing explainable visual emotion analysis beyond single discrete labels, supporting more detailed affective reasoning and unified discrete-continuous representations in computer vision.

major comments (1)
  1. [Abstract and Methods/Pipeline section] The central claim that the multi-stage MLLM pipeline produces trustworthy annotations 'with high reliability and minimal human effort' (Abstract and pipeline description) lacks any supporting quantitative evidence. No inter-annotator agreement metrics (e.g., Fleiss' kappa), human-expert agreement on a sampled subset, error analysis, or bias audits for affective decomposition or visual grounding are reported. This directly undermines the trustworthiness of the 219k B-A-S triplets and dual labels, which is load-bearing for the dataset's utility.
minor comments (2)
  1. [Abstract] Clarify the exact number of images, triplets, and annotation coverage statistics in the abstract or early introduction for immediate context.
  2. [Introduction and Dataset sections] Ensure consistent terminology when referring to 'subject-level' versus 'word-level' emotional reasoning across sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding the need for quantitative evidence supporting the reliability of the annotations below.

read point-by-point responses
  1. Referee: [Abstract and Methods/Pipeline section] The central claim that the multi-stage MLLM pipeline produces trustworthy annotations 'with high reliability and minimal human effort' (Abstract and pipeline description) lacks any supporting quantitative evidence. No inter-annotator agreement metrics (e.g., Fleiss' kappa), human-expert agreement on a sampled subset, error analysis, or bias audits for affective decomposition or visual grounding are reported. This directly undermines the trustworthiness of the 219k B-A-S triplets and dual labels, which is load-bearing for the dataset's utility.

    Authors: We agree that the manuscript would be strengthened by including explicit quantitative validation of the annotation quality. In the revised version, we will add a new subsection under Methods detailing a human evaluation protocol. This will report inter-annotator agreement (Fleiss' kappa for CES labels and Pearson/Spearman correlations for DES dimensions) between the MLLM outputs and multiple human experts on a stratified sample of at least 1,000 images. We will also include a systematic error analysis of B-A-S triplet decomposition and region grounding accuracy, along with a bias audit examining potential demographic or cultural skews in the generated labels. These additions will provide direct empirical support for the pipeline's reliability claims while preserving the minimal-human-effort design. revision: yes

Circularity Check

0 steps flagged

Dataset release exhibits no circularity in claimed derivations

full rationale

The paper introduces EmoVerse as a dataset constructed via a multi-stage MLLM pipeline for B-A-S triplet annotations and dual CES/DES labels across 219k images. No mathematical derivations, predictions, fitted parameters, or uniqueness theorems are presented that reduce to the inputs by construction. The central value lies in the external usability of the released dataset and its annotations rather than any internal self-referential logic, self-citation chains, or ansatzes. Claims of high annotation reliability with minimal human effort are methodological assertions without evidence of reduction to prior author work or statistical forcing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are invoked; the contribution is an empirical dataset and annotation pipeline rather than a theoretical model.

pith-pipeline@v0.9.0 · 5522 in / 1106 out tokens · 42431 ms · 2026-05-17T21:46:13.962215+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models

    cs.CV 2026-05 conditional novelty 7.0

    MultiEmo-Bench supplies 10,344 images with aggregated multi-label emotion votes from 20 annotators each to evaluate MLLMs on dominant emotion and full distribution prediction.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Artemis: Affective language for visual art

    Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, and Leonidas J Guibas. Artemis: Affective language for visual art. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11569–11579, 2021. 3

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 5

  3. [3]

    What’s the point: Semantic segmentation with point supervision

    Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision. InEuropean conference on computer vision, pages 549–565. Springer, 2016. 2

  4. [4]

    Evaluating vision- language models for emotion recognition.arXiv preprint arXiv:2502.05660, 2025

    Sree Bhattacharyya and James Z Wang. Evaluating vision- language models for emotion recognition.arXiv preprint arXiv:2502.05660, 2025. 4

  5. [5]

    Emova: Empowering language models to see, hear and speak with vivid emotions

    Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, et al. Emova: Empowering language models to see, hear and speak with vivid emotions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5455–5466, 2025. 2

  6. [6]

    Emodubber: Towards high quality and emotion con- trollable movie dubbing

    Gaoxiang Cong, Jiadong Pan, Liang Li, Yuankai Qi, Yuxin Peng, Anton van den Hengel, Jian Yang, and Qingming Huang. Emodubber: Towards high quality and emotion con- trollable movie dubbing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15863– 15873, 2025. 2

  7. [7]

    Amazon mechanical turk: A research tool for organizations and information systems scholars

    Kevin Crowston. Amazon mechanical turk: A research tool for organizations and information systems scholars. InShap- ing the Future of ICT Research. Methods and Approaches: IFIP WG 8.2, Working Conference, Tampa, FL, USA, De- cember 13-14, 2012. Proceedings, pages 210–221. Springer,

  8. [8]

    Emoticrafter: Text-to-emotional-image generation based on valence-arousal model

    Shengqi Dang, Yi He, Long Ling, Ziqing Qian, Nanxuan Zhao, and Nan Cao. Emoticrafter: Text-to-emotional-image generation based on valence-arousal model. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15218–15228, 2025. 1

  9. [9]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2

  10. [10]

    Emoe: Modality-specific enhanced dynamic emotion experts

    Yiyang Fang, Wenke Huang, Guancheng Wan, Kehua Su, and Mang Ye. Emoe: Modality-specific enhanced dynamic emotion experts. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14314–14324, 2025. 2

  11. [11]

    Riccardo Gervasi, Federico Barravecchia, Luca Mastrogia- como, and Fiorenzo Franceschini. Applications of affective computing in human-robot interaction: State-of-art and chal- lenges for manufacturing.Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Man- ufacture, 237(6-7):815–832, 2023. 1

  12. [12]

    Grounding emotion recognition with visual prototypes: Vega–revisiting clip in merc.arXiv preprint arXiv:2508.06564, 2025

    Guanyu Hu, Dimitrios Kollias, and Xinyu Yang. Grounding emotion recognition with visual prototypes: Vega–revisiting clip in merc.arXiv preprint arXiv:2508.06564, 2025. 1

  13. [13]

    Music2palette: Emotion-aligned color palette generation via cross-modal representation learning.arXiv preprint arXiv:2507.04758, 2025

    Jiayun Hu, Yueyi He, Tianyi Liang, Changbo Wang, and Chenhui Li. Music2palette: Emotion-aligned color palette generation via cross-modal representation learning.arXiv preprint arXiv:2507.04758, 2025. 2

  14. [14]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

  15. [15]

    Image sentiment anal- ysis using latent correlations among visual, textual, and sen- timent views

    Marie Katsurai and Shin’ichi Satoh. Image sentiment anal- ysis using latent correlations among visual, textual, and sen- timent views. In2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2837–2841. IEEE, 2016. 2, 3

  16. [16]

    Con- textface: Generating facial expressions from emotional con- texts

    Min-jung Kim, Minsang Kim, and Seung Jun Baek. Con- textface: Generating facial expressions from emotional con- texts. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 11383–11392, 2025. 2

  17. [17]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2

  18. [18]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017. 2

  19. [19]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1

  20. [20]

    Make me happier: Evoking emotions through image diffusion models

    Qing Lin, Jingfeng Zhang, Yew-Soon Ong, and Mengmi Zhang. Make me happier: Evoking emotions through image diffusion models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 16367–16376,

  21. [21]

    Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation

    Yukang Lin, Hokit Fung, Jianjin Xu, Zeping Ren, Adela SM Lau, Guosheng Yin, and Xiu Li. Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26242–26252, 2025. 2

  22. [22]

    Moee: Mixture of emotion experts for audio-driven portrait animation

    Huaize Liu, Wenzhang Sun, Donglin Di, Shibo Sun, Jiahui Yang, Changqing Zou, and Hujun Bao. Moee: Mixture of emotion experts for audio-driven portrait animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 26222–26231, 2025. 1

  23. [23]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

  24. [24]

    Findingemo: An im- age dataset for emotion recognition in the wild.Advances in Neural Information Processing Systems, 37:4956–4996,

    Laurent Mertens, Elahe Yargholi, Hans Op de Beeck, Jan Van den Stock, and Joost Vennekens. Findingemo: An im- age dataset for emotion recognition in the wild.Advances in Neural Information Processing Systems, 37:4956–4996,

  25. [25]

    Emotional category data on images from the interna- tional affective picture system.Behavior research methods, 37(4):626–630, 2005

    Joseph A Mikels, Barbara L Fredrickson, Gregory R Larkin, Casey M Lindberg, Sam J Maglio, and Patricia A Reuter- Lorenz. Emotional category data on images from the interna- tional affective picture system.Behavior research methods, 37(4):626–630, 2005. 2

  26. [26]

    Psychology Press, 2017

    Paula M Niedenthal and Franc ¸ois Ric.Psychology of emo- tion. Psychology Press, 2017. 2

  27. [27]

    A mixed bag of emotions: Model, pre- dict, and transfer emotion distributions

    Kuan-Chuan Peng, Tsuhan Chen, Amir Sadovnik, and An- drew C Gallagher. A mixed bag of emotions: Model, pre- dict, and transfer emotion distributions. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 860–868, 2015. 3, 7

  28. [28]

    Emotion models: a review

    Sreeja PS and G Mahalakshmi. Emotion models: a review. International Journal of Control Theory and Applications, 10(8):651–657, 2017. 2, 4

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1

  30. [30]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational confer- ence on machine learning, pages 8821–8831. Pmlr, 2021. 1

  31. [31]

    Learning from crowds.Journal of machine learning research, 11(4), 2010

    Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Her- mosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds.Journal of machine learning research, 11(4), 2010. 2

  32. [32]

    Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015. 2

  33. [33]

    Labelme: a database and web-based tool for image annotation.International journal of computer vision, 77(1):157–173, 2008

    Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman. Labelme: a database and web-based tool for image annotation.International journal of computer vision, 77(1):157–173, 2008. 2

  34. [34]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 3

  35. [35]

    Cocoer: Aligning multi- level feature by competition and coordination for emotion recognition

    Xuli Shen, Hua Cai, Weilin Shen, Qing Xu, Dingding Yu, Weifeng Ge, and Xiangyang Xue. Cocoer: Aligning multi- level feature by competition and coordination for emotion recognition. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 29591–29600, 2025. 2

  36. [36]

    Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks

    Rion Snow, Brendan O’connor, Dan Jurafsky, and Andrew Y Ng. Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks. InProceedings of the 2008 conference on empirical methods in natural language processing, pages 254–263, 2008. 2

  37. [37]

    Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion

    Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, et al. Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26212–26221, 2025. 2

  38. [38]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2, 5

  39. [39]

    Online crowdsourcing: rating annotators and obtaining cost-effective labels

    Peter Welinder and Pietro Perona. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pages 25–32. IEEE, 2010. 2

  40. [40]

    Emotion across modalities and cul- tures: Multilingual multimodal emotion-cause analysis with memory-inspired framework

    Dan Wu, Xincheng Ju, Dong Zhang, Shoushan Li, Erik Cam- bria, and Guodong Zhou. Emotion across modalities and cul- tures: Multilingual multimodal emotion-cause analysis with memory-inspired framework. InProceedings of ACM MM,

  41. [41]

    Seek common ground while reserving differences: Semi- supervised image-text sentiment recognition

    Wuyou Xia, Guoli Jia, Sicheng Zhao, and Jufeng Yang. Seek common ground while reserving differences: Semi- supervised image-text sentiment recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29601–29611, 2025. 2

  42. [42]

    Emovit: Revolutionizing emotion insights with vi- sual instruction tuning

    Hongxia Xie, Chu-Jun Peng, Yu-Wen Tseng, Hung-Jen Chen, Chan-Feng Hsu, Hong-Han Shuai, and Wen-Huang Cheng. Emovit: Revolutionizing emotion insights with vi- sual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26596–26605, 2024. 2, 5

  43. [43]

    Emoset: A large-scale visual emotion dataset with rich attributes

    Jingyuan Yang, Qirui Huang, Tingting Ding, Dani Lischin- ski, Danny Cohen-Or, and Hui Huang. Emoset: A large-scale visual emotion dataset with rich attributes. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 20383–20394, 2023. 2, 3, 6, 7

  44. [44]

    Emogen: Emotional image content generation with text-to-image dif- fusion models

    Jingyuan Yang, Jiawei Feng, and Hui Huang. Emogen: Emotional image content generation with text-to-image dif- fusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6358– 6368, 2024. 2

  45. [45]

    Emoedit: Evoking emo- tions through image manipulation

    Jingyuan Yang, Jiawei Feng, Weibin Luo, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Emoedit: Evoking emo- tions through image manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24690–24699, 2025. 2

  46. [46]

    Uncertain multimodal intention and emotion understanding in the wild

    Qu Yang, Qinghongya Shi, Tongxin Wang, and Mang Ye. Uncertain multimodal intention and emotion understanding in the wild. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 24700–24709, 2025. 3

  47. [47]

    Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervised cross-domain visual emotion recognition

    Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, and Tao He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervised cross-domain visual emotion recognition. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 3888–3898, 2025. 3

  48. [48]

    Building a large scale dataset for image emotion recognition: The fine print and the benchmark

    Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Building a large scale dataset for image emotion recognition: The fine print and the benchmark. InProceedings of the AAAI conference on artificial intelligence, 2016. 2, 3, 6

  49. [49]

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 3, 7

  50. [50]

    Emoart: A multidimensional dataset for emotion-aware artistic generation.arXiv preprint arXiv:2506.03652, 2025

    Cheng Zhang, Bin Wen, Songhan Zuo, Ruoxuan Zhang, Wen-huang Cheng, et al. Emoart: A multidimensional dataset for emotion-aware artistic generation.arXiv preprint arXiv:2506.03652, 2025. 2, 3, 7

  51. [51]

    Emit: Emotional interac- tion control in text-to-image diffusion models

    Haofan Zhang and Shangfei Wang. Emit: Emotional interac- tion control in text-to-image diffusion models. InProceed- ings of the 33rd ACM International Conference on Multime- dia, pages 9950–9958, 2025. 1

  52. [52]

    Open vocabulary emotion prediction based on large mul- timodal models

    Zixing Zhang, Zhongren Dong, Zhiqiang Gao, Shihao Gao, Donghao Wang, Ciqiang Chen, Yuhan Nie, and Huan Zhao. Open vocabulary emotion prediction based on large mul- timodal models. InProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Comput- ing, pages 99–103, 2024. 3

  53. [53]

    Predicting personalized image emotion per- ceptions in social networks.IEEE transactions on affective computing, 9(4):526–540, 2016

    Sicheng Zhao, Hongxun Yao, Yue Gao, Guiguang Ding, and Tat-Seng Chua. Predicting personalized image emotion per- ceptions in social networks.IEEE transactions on affective computing, 9(4):526–540, 2016. 2, 4

  54. [54]

    Emosym: A symbiotic framework for uni- fied emotional understanding and generation via latent rea- soning

    Yijie Zhu, Yibo Lyu, Zitong Yu, Rui Shao, Kaiyang Zhou, and Liqiang Nie. Emosym: A symbiotic framework for uni- fied emotional understanding and generation via latent rea- soning. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 5451–5460, 2025. 3