EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis

arxiv: 2511.12554 · v2 · submitted 2025-11-16 · 💻 cs.CV

EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis

Yijie Guo , Dexiang Hong , Weidong Chen , Zihan She , Cheng Ye , Xiaojun Chang , Zhendong Mao This is my paper

Pith reviewed 2026-05-17 21:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual emotion analysisemotion datasetB-A-S tripletsinterpretable annotationsMLLM pipelinecategorical emotion statesdimensional emotion spacegrounded visual reasoning

0 comments p. Extension

The pith

EmoVerse dataset decomposes emotions in images into Background-Attribute-Subject triplets grounded to visual regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EmoVerse as a large dataset of over 219k images that breaks visual emotions down into specific Background-Attribute-Subject triplets, each tied to particular parts of the image. This approach moves past single overall emotion labels by adding word-level and subject-level reasoning, while supplying both categorical emotion states and dimensional emotion space annotations for the same content. A multi-stage pipeline built around multi-modal large language models handles most of the annotation work with only light human checks. If the claims hold, researchers gain a resource that supports models capable of explaining which visual elements drive a given emotional response instead of just outputting a label.

Core claim

EmoVerse supplies a knowledge-graph-inspired annotation scheme that decomposes emotions into Background-Attribute-Subject triplets grounded to image regions, paired with dual CES and DES labels across more than 219k images, all produced through a multi-stage MLLM pipeline that keeps human effort low, and this structure directly enables interpretable visual emotion analysis together with a model that maps visual cues to dimensional representations while providing attribution explanations.

What carries the argument

The Background-Attribute-Subject (B-A-S) triplet decomposition that grounds each emotional element to specific visual regions in the image.

If this is right

The dataset directly supports word-level and subject-level emotional reasoning by linking each triplet element to image regions.
Dual CES and DES annotations allow the same images to serve both discrete category and continuous dimension analyses.
The accompanying model maps visual cues into dimensional emotion representations while supplying attribution explanations for each prediction.
The pipeline reduces the scale of manual annotation needed while maintaining the structure required for interpretable outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The triplet grounding could be applied to video sequences to track how emotional elements change over time.
Content recommendation systems might use the region-specific labels to filter or prioritize images based on targeted emotional triggers.
Cross-cultural tests on the dataset could reveal whether the language-model-generated annotations carry biases in emotional interpretation.

Load-bearing premise

The multi-stage pipeline assumes that outputs from multi-modal large language models contain few systematic biases or inconsistencies and therefore require only light human verification to reach high reliability.

What would settle it

A controlled check in which independent human annotators compare the provided B-A-S triplet labels against the actual image content on a random sample and report low agreement rates would show the annotations do not reliably capture the intended elements.

Figures

Figures reproduced from arXiv: 2511.12554 by Cheng Ye, Dexiang Hong, Weidong Chen, Xiaojun Chang, Yijie Guo, Zhendong Mao, Zihan She.

**Figure 1.** Figure 1: EmoVerse Dataset introduces the first large-scale visual emotion dataset that combines Categorical Emotion States (CES) and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of EmoVerse. EmoVerse collects images from multiple sources. Images collected pass through Annotation and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of our Interpretable Model. Model fine [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Knowledge graph based on B-A-S triplets. Decoupled [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Emotion category distribution statistics. Colored segments show the percentage of each category. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of emotion cloud use DES embeddings, projected through MDS. DES trained with full attribution exhibits the most [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of comparison. The model fine-tuned by [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Visual Emotion Analysis (VEA) aims to bridge the affective gap between visual content and human emotional responses. Despite its promise, progress in this field remains limited by the lack of open-source and interpretable datasets. Most existing studies assign a single discrete emotion label to an entire image, offering limited insight into how visual elements contribute to emotion. In this work, we introduce EmoVerse, a large-scale open-source dataset that enables interpretable visual emotion analysis through multi-layered, knowledge-graph-inspired annotations. By decomposing emotions into Background-Attribute-Subject (B-A-S) triplets and grounding each element to visual regions, EmoVerse provides word-level and subject-level emotional reasoning. With over 219k images, the dataset further includes dual annotations in Categorical Emotion States (CES) and Dimensional Emotion Space (DES), facilitating unified discrete and continuous emotion representation. A novel multi-stage pipeline ensures high annotation reliability with minimal human effort. Finally, we introduce an interpretable model that maps visual cues into DES representations and provides detailed attribution explanations. Together, the dataset, pipeline, and model form a comprehensive foundation for advancing explainable high-level emotion understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EmoVerse releases a dataset with region-grounded B-A-S triplets and dual CES/DES labels, but its reliability claims rest on unshown validation.

read the letter

The key takeaway is that EmoVerse is a new large-scale dataset for visual emotion analysis featuring region-grounded B-A-S triplets and dual categorical and dimensional annotations. This structuring choice stands out from previous single-label approaches. The paper does well in motivating the need for more interpretable annotations and in scaling up the data collection using MLLMs with a multi-stage pipeline. The inclusion of both CES and DES labels allows for flexible use in different modeling contexts, and the attempt to provide attribution explanations through the introduced model adds to its potential utility. Soft spots center on the reliability claims. The pipeline is said to ensure high annotation quality with minimal human effort, yet no quantitative metrics like agreement scores or error rates are provided. This is a noticeable gap because emotion judgments are inherently variable, and without evidence of low bias or high consistency, the dataset's trustworthiness for downstream tasks remains to be proven. It's not a deal-breaker for a dataset paper, but it weakens the central selling point. The work appears to build reasonably on existing literature without major citation gaps. This is for researchers in visual emotion analysis and explainable affective AI. Readers interested in fine-grained emotion datasets would find the scale and structure valuable, even if they plan to add their own checks. I recommend sending it for peer review. The dataset has enough novelty and potential impact to warrant referee input, particularly on the validation aspects.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces EmoVerse, a large-scale open-source dataset of over 219k images for interpretable visual emotion analysis. Emotions are decomposed into Background-Attribute-Subject (B-A-S) triplets with visual region grounding to enable word-level and subject-level reasoning; dual annotations are provided in both Categorical Emotion States (CES) and Dimensional Emotion Space (DES). A novel multi-stage MLLM pipeline is used to generate the annotations with claimed high reliability and minimal human effort, and an interpretable model is presented that maps visual cues to DES representations with attribution explanations.

Significance. If the reliability of the B-A-S annotations and dual CES/DES labels is demonstrated, the dataset would provide a valuable resource for advancing explainable visual emotion analysis beyond single discrete labels, supporting more detailed affective reasoning and unified discrete-continuous representations in computer vision.

major comments (1)

[Abstract and Methods/Pipeline section] The central claim that the multi-stage MLLM pipeline produces trustworthy annotations 'with high reliability and minimal human effort' (Abstract and pipeline description) lacks any supporting quantitative evidence. No inter-annotator agreement metrics (e.g., Fleiss' kappa), human-expert agreement on a sampled subset, error analysis, or bias audits for affective decomposition or visual grounding are reported. This directly undermines the trustworthiness of the 219k B-A-S triplets and dual labels, which is load-bearing for the dataset's utility.

minor comments (2)

[Abstract] Clarify the exact number of images, triplets, and annotation coverage statistics in the abstract or early introduction for immediate context.
[Introduction and Dataset sections] Ensure consistent terminology when referring to 'subject-level' versus 'word-level' emotional reasoning across sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding the need for quantitative evidence supporting the reliability of the annotations below.

read point-by-point responses

Referee: [Abstract and Methods/Pipeline section] The central claim that the multi-stage MLLM pipeline produces trustworthy annotations 'with high reliability and minimal human effort' (Abstract and pipeline description) lacks any supporting quantitative evidence. No inter-annotator agreement metrics (e.g., Fleiss' kappa), human-expert agreement on a sampled subset, error analysis, or bias audits for affective decomposition or visual grounding are reported. This directly undermines the trustworthiness of the 219k B-A-S triplets and dual labels, which is load-bearing for the dataset's utility.

Authors: We agree that the manuscript would be strengthened by including explicit quantitative validation of the annotation quality. In the revised version, we will add a new subsection under Methods detailing a human evaluation protocol. This will report inter-annotator agreement (Fleiss' kappa for CES labels and Pearson/Spearman correlations for DES dimensions) between the MLLM outputs and multiple human experts on a stratified sample of at least 1,000 images. We will also include a systematic error analysis of B-A-S triplet decomposition and region grounding accuracy, along with a bias audit examining potential demographic or cultural skews in the generated labels. These additions will provide direct empirical support for the pipeline's reliability claims while preserving the minimal-human-effort design. revision: yes

Circularity Check

0 steps flagged

Dataset release exhibits no circularity in claimed derivations

full rationale

The paper introduces EmoVerse as a dataset constructed via a multi-stage MLLM pipeline for B-A-S triplet annotations and dual CES/DES labels across 219k images. No mathematical derivations, predictions, fitted parameters, or uniqueness theorems are presented that reduce to the inputs by construction. The central value lies in the external usability of the released dataset and its annotations rather than any internal self-referential logic, self-citation chains, or ansatzes. Claims of high annotation reliability with minimal human effort are methodological assertions without evidence of reduction to prior author work or statistical forcing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are invoked; the contribution is an empirical dataset and annotation pipeline rather than a theoretical model.

pith-pipeline@v0.9.0 · 5522 in / 1106 out tokens · 42431 ms · 2026-05-17T21:46:13.962215+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A novel multi-stage pipeline ensures high annotation reliability with minimal human effort... B-A-S triplets and grounding each element to visual regions
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EmoVerse provides word-level and subject-level emotional reasoning by decomposing emotions into Background-Attribute-Subject (B-A-S) triplets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models
cs.CV 2026-05 conditional novelty 7.0

MultiEmo-Bench supplies 10,344 images with aggregated multi-label emotion votes from 20 annotators each to evaluate MLLMs on dominant emotion and full distribution prediction.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Artemis: Affective language for visual art

Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, and Leonidas J Guibas. Artemis: Affective language for visual art. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11569–11579, 2021. 3

work page 2021
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

What’s the point: Semantic segmentation with point supervision

Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision. InEuropean conference on computer vision, pages 549–565. Springer, 2016. 2

work page 2016
[4]

Evaluating vision- language models for emotion recognition.arXiv preprint arXiv:2502.05660, 2025

Sree Bhattacharyya and James Z Wang. Evaluating vision- language models for emotion recognition.arXiv preprint arXiv:2502.05660, 2025. 4

work page arXiv 2025
[5]

Emova: Empowering language models to see, hear and speak with vivid emotions

Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, et al. Emova: Empowering language models to see, hear and speak with vivid emotions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5455–5466, 2025. 2

work page 2025
[6]

Emodubber: Towards high quality and emotion con- trollable movie dubbing

Gaoxiang Cong, Jiadong Pan, Liang Li, Yuankai Qi, Yuxin Peng, Anton van den Hengel, Jian Yang, and Qingming Huang. Emodubber: Towards high quality and emotion con- trollable movie dubbing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15863– 15873, 2025. 2

work page 2025
[7]

Amazon mechanical turk: A research tool for organizations and information systems scholars

Kevin Crowston. Amazon mechanical turk: A research tool for organizations and information systems scholars. InShap- ing the Future of ICT Research. Methods and Approaches: IFIP WG 8.2, Working Conference, Tampa, FL, USA, De- cember 13-14, 2012. Proceedings, pages 210–221. Springer,

work page 2012
[8]

Emoticrafter: Text-to-emotional-image generation based on valence-arousal model

Shengqi Dang, Yi He, Long Ling, Ziqing Qian, Nanxuan Zhao, and Nan Cao. Emoticrafter: Text-to-emotional-image generation based on valence-arousal model. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15218–15228, 2025. 1

work page 2025
[9]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2

work page 2009
[10]

Emoe: Modality-specific enhanced dynamic emotion experts

Yiyang Fang, Wenke Huang, Guancheng Wan, Kehua Su, and Mang Ye. Emoe: Modality-specific enhanced dynamic emotion experts. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14314–14324, 2025. 2

work page 2025
[11]

Riccardo Gervasi, Federico Barravecchia, Luca Mastrogia- como, and Fiorenzo Franceschini. Applications of affective computing in human-robot interaction: State-of-art and chal- lenges for manufacturing.Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Man- ufacture, 237(6-7):815–832, 2023. 1

work page 2023
[12]

Grounding emotion recognition with visual prototypes: Vega–revisiting clip in merc.arXiv preprint arXiv:2508.06564, 2025

Guanyu Hu, Dimitrios Kollias, and Xinyu Yang. Grounding emotion recognition with visual prototypes: Vega–revisiting clip in merc.arXiv preprint arXiv:2508.06564, 2025. 1

work page arXiv 2025
[13]

Music2palette: Emotion-aligned color palette generation via cross-modal representation learning.arXiv preprint arXiv:2507.04758, 2025

Jiayun Hu, Yueyi He, Tianyi Liang, Changbo Wang, and Chenhui Li. Music2palette: Emotion-aligned color palette generation via cross-modal representation learning.arXiv preprint arXiv:2507.04758, 2025. 2

work page arXiv 2025
[14]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

work page
[15]

Image sentiment anal- ysis using latent correlations among visual, textual, and sen- timent views

Marie Katsurai and Shin’ichi Satoh. Image sentiment anal- ysis using latent correlations among visual, textual, and sen- timent views. In2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2837–2841. IEEE, 2016. 2, 3

work page 2016
[16]

Con- textface: Generating facial expressions from emotional con- texts

Min-jung Kim, Minsang Kim, and Seung Jun Baek. Con- textface: Generating facial expressions from emotional con- texts. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 11383–11392, 2025. 2

work page 2025
[17]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2

work page 2023
[18]

Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017. 2

work page 2017
[19]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1

work page 2022
[20]

Make me happier: Evoking emotions through image diffusion models

Qing Lin, Jingfeng Zhang, Yew-Soon Ong, and Mengmi Zhang. Make me happier: Evoking emotions through image diffusion models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 16367–16376,

work page
[21]

Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation

Yukang Lin, Hokit Fung, Jianjin Xu, Zeping Ren, Adela SM Lau, Guosheng Yin, and Xiu Li. Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26242–26252, 2025. 2

work page 2025
[22]

Moee: Mixture of emotion experts for audio-driven portrait animation

Huaize Liu, Wenzhang Sun, Donglin Di, Shibo Sun, Jiahui Yang, Changqing Zou, and Hujun Bao. Moee: Mixture of emotion experts for audio-driven portrait animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 26222–26231, 2025. 1

work page 2025
[23]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

work page
[24]

Findingemo: An im- age dataset for emotion recognition in the wild.Advances in Neural Information Processing Systems, 37:4956–4996,

Laurent Mertens, Elahe Yargholi, Hans Op de Beeck, Jan Van den Stock, and Joost Vennekens. Findingemo: An im- age dataset for emotion recognition in the wild.Advances in Neural Information Processing Systems, 37:4956–4996,

work page
[25]

Emotional category data on images from the interna- tional affective picture system.Behavior research methods, 37(4):626–630, 2005

Joseph A Mikels, Barbara L Fredrickson, Gregory R Larkin, Casey M Lindberg, Sam J Maglio, and Patricia A Reuter- Lorenz. Emotional category data on images from the interna- tional affective picture system.Behavior research methods, 37(4):626–630, 2005. 2

work page 2005
[26]

Psychology Press, 2017

Paula M Niedenthal and Franc ¸ois Ric.Psychology of emo- tion. Psychology Press, 2017. 2

work page 2017
[27]

A mixed bag of emotions: Model, pre- dict, and transfer emotion distributions

Kuan-Chuan Peng, Tsuhan Chen, Amir Sadovnik, and An- drew C Gallagher. A mixed bag of emotions: Model, pre- dict, and transfer emotion distributions. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 860–868, 2015. 3, 7

work page 2015
[28]

Emotion models: a review

Sreeja PS and G Mahalakshmi. Emotion models: a review. International Journal of Control Theory and Applications, 10(8):651–657, 2017. 2, 4

work page 2017
[29]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1

work page 2021
[30]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational confer- ence on machine learning, pages 8821–8831. Pmlr, 2021. 1

work page 2021
[31]

Learning from crowds.Journal of machine learning research, 11(4), 2010

Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Her- mosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds.Journal of machine learning research, 11(4), 2010. 2

work page 2010
[32]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015. 2

work page 2015
[33]

Labelme: a database and web-based tool for image annotation.International journal of computer vision, 77(1):157–173, 2008

Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman. Labelme: a database and web-based tool for image annotation.International journal of computer vision, 77(1):157–173, 2008. 2

work page 2008
[34]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Cocoer: Aligning multi- level feature by competition and coordination for emotion recognition

Xuli Shen, Hua Cai, Weilin Shen, Qing Xu, Dingding Yu, Weifeng Ge, and Xiangyang Xue. Cocoer: Aligning multi- level feature by competition and coordination for emotion recognition. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 29591–29600, 2025. 2

work page 2025
[36]

Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks

Rion Snow, Brendan O’connor, Dan Jurafsky, and Andrew Y Ng. Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks. InProceedings of the 2008 conference on empirical methods in natural language processing, pages 254–263, 2008. 2

work page 2008
[37]

Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion

Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, et al. Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26212–26221, 2025. 2

work page 2025
[38]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2, 5

work page 2022
[39]

Online crowdsourcing: rating annotators and obtaining cost-effective labels

Peter Welinder and Pietro Perona. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pages 25–32. IEEE, 2010. 2

work page 2010
[40]

Emotion across modalities and cul- tures: Multilingual multimodal emotion-cause analysis with memory-inspired framework

Dan Wu, Xincheng Ju, Dong Zhang, Shoushan Li, Erik Cam- bria, and Guodong Zhou. Emotion across modalities and cul- tures: Multilingual multimodal emotion-cause analysis with memory-inspired framework. InProceedings of ACM MM,

work page
[41]

Seek common ground while reserving differences: Semi- supervised image-text sentiment recognition

Wuyou Xia, Guoli Jia, Sicheng Zhao, and Jufeng Yang. Seek common ground while reserving differences: Semi- supervised image-text sentiment recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29601–29611, 2025. 2

work page 2025
[42]

Emovit: Revolutionizing emotion insights with vi- sual instruction tuning

Hongxia Xie, Chu-Jun Peng, Yu-Wen Tseng, Hung-Jen Chen, Chan-Feng Hsu, Hong-Han Shuai, and Wen-Huang Cheng. Emovit: Revolutionizing emotion insights with vi- sual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26596–26605, 2024. 2, 5

work page 2024
[43]

Emoset: A large-scale visual emotion dataset with rich attributes

Jingyuan Yang, Qirui Huang, Tingting Ding, Dani Lischin- ski, Danny Cohen-Or, and Hui Huang. Emoset: A large-scale visual emotion dataset with rich attributes. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 20383–20394, 2023. 2, 3, 6, 7

work page 2023
[44]

Emogen: Emotional image content generation with text-to-image dif- fusion models

Jingyuan Yang, Jiawei Feng, and Hui Huang. Emogen: Emotional image content generation with text-to-image dif- fusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6358– 6368, 2024. 2

work page 2024
[45]

Emoedit: Evoking emo- tions through image manipulation

Jingyuan Yang, Jiawei Feng, Weibin Luo, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Emoedit: Evoking emo- tions through image manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24690–24699, 2025. 2

work page 2025
[46]

Uncertain multimodal intention and emotion understanding in the wild

Qu Yang, Qinghongya Shi, Tongxin Wang, and Mang Ye. Uncertain multimodal intention and emotion understanding in the wild. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 24700–24709, 2025. 3

work page 2025
[47]

Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervised cross-domain visual emotion recognition

Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, and Tao He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervised cross-domain visual emotion recognition. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 3888–3898, 2025. 3

work page 2025
[48]

Building a large scale dataset for image emotion recognition: The fine print and the benchmark

Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Building a large scale dataset for image emotion recognition: The fine print and the benchmark. InProceedings of the AAAI conference on artificial intelligence, 2016. 2, 3, 6

work page 2016
[49]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 3, 7

work page 2014
[50]

Emoart: A multidimensional dataset for emotion-aware artistic generation.arXiv preprint arXiv:2506.03652, 2025

Cheng Zhang, Bin Wen, Songhan Zuo, Ruoxuan Zhang, Wen-huang Cheng, et al. Emoart: A multidimensional dataset for emotion-aware artistic generation.arXiv preprint arXiv:2506.03652, 2025. 2, 3, 7

work page arXiv 2025
[51]

Emit: Emotional interac- tion control in text-to-image diffusion models

Haofan Zhang and Shangfei Wang. Emit: Emotional interac- tion control in text-to-image diffusion models. InProceed- ings of the 33rd ACM International Conference on Multime- dia, pages 9950–9958, 2025. 1

work page 2025
[52]

Open vocabulary emotion prediction based on large mul- timodal models

Zixing Zhang, Zhongren Dong, Zhiqiang Gao, Shihao Gao, Donghao Wang, Ciqiang Chen, Yuhan Nie, and Huan Zhao. Open vocabulary emotion prediction based on large mul- timodal models. InProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Comput- ing, pages 99–103, 2024. 3

work page 2024
[53]

Predicting personalized image emotion per- ceptions in social networks.IEEE transactions on affective computing, 9(4):526–540, 2016

Sicheng Zhao, Hongxun Yao, Yue Gao, Guiguang Ding, and Tat-Seng Chua. Predicting personalized image emotion per- ceptions in social networks.IEEE transactions on affective computing, 9(4):526–540, 2016. 2, 4

work page 2016
[54]

Emosym: A symbiotic framework for uni- fied emotional understanding and generation via latent rea- soning

Yijie Zhu, Yibo Lyu, Zitong Yu, Rui Shao, Kaiyang Zhou, and Liqiang Nie. Emosym: A symbiotic framework for uni- fied emotional understanding and generation via latent rea- soning. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 5451–5460, 2025. 3

work page 2025

[1] [1]

Artemis: Affective language for visual art

Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, and Leonidas J Guibas. Artemis: Affective language for visual art. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11569–11579, 2021. 3

work page 2021

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

What’s the point: Semantic segmentation with point supervision

Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision. InEuropean conference on computer vision, pages 549–565. Springer, 2016. 2

work page 2016

[4] [4]

Evaluating vision- language models for emotion recognition.arXiv preprint arXiv:2502.05660, 2025

Sree Bhattacharyya and James Z Wang. Evaluating vision- language models for emotion recognition.arXiv preprint arXiv:2502.05660, 2025. 4

work page arXiv 2025

[5] [5]

Emova: Empowering language models to see, hear and speak with vivid emotions

Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, et al. Emova: Empowering language models to see, hear and speak with vivid emotions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5455–5466, 2025. 2

work page 2025

[6] [6]

Emodubber: Towards high quality and emotion con- trollable movie dubbing

Gaoxiang Cong, Jiadong Pan, Liang Li, Yuankai Qi, Yuxin Peng, Anton van den Hengel, Jian Yang, and Qingming Huang. Emodubber: Towards high quality and emotion con- trollable movie dubbing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15863– 15873, 2025. 2

work page 2025

[7] [7]

Amazon mechanical turk: A research tool for organizations and information systems scholars

Kevin Crowston. Amazon mechanical turk: A research tool for organizations and information systems scholars. InShap- ing the Future of ICT Research. Methods and Approaches: IFIP WG 8.2, Working Conference, Tampa, FL, USA, De- cember 13-14, 2012. Proceedings, pages 210–221. Springer,

work page 2012

[8] [8]

Emoticrafter: Text-to-emotional-image generation based on valence-arousal model

Shengqi Dang, Yi He, Long Ling, Ziqing Qian, Nanxuan Zhao, and Nan Cao. Emoticrafter: Text-to-emotional-image generation based on valence-arousal model. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15218–15228, 2025. 1

work page 2025

[9] [9]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2

work page 2009

[10] [10]

Emoe: Modality-specific enhanced dynamic emotion experts

Yiyang Fang, Wenke Huang, Guancheng Wan, Kehua Su, and Mang Ye. Emoe: Modality-specific enhanced dynamic emotion experts. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14314–14324, 2025. 2

work page 2025

[11] [11]

Riccardo Gervasi, Federico Barravecchia, Luca Mastrogia- como, and Fiorenzo Franceschini. Applications of affective computing in human-robot interaction: State-of-art and chal- lenges for manufacturing.Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Man- ufacture, 237(6-7):815–832, 2023. 1

work page 2023

[12] [12]

Grounding emotion recognition with visual prototypes: Vega–revisiting clip in merc.arXiv preprint arXiv:2508.06564, 2025

Guanyu Hu, Dimitrios Kollias, and Xinyu Yang. Grounding emotion recognition with visual prototypes: Vega–revisiting clip in merc.arXiv preprint arXiv:2508.06564, 2025. 1

work page arXiv 2025

[13] [13]

Music2palette: Emotion-aligned color palette generation via cross-modal representation learning.arXiv preprint arXiv:2507.04758, 2025

Jiayun Hu, Yueyi He, Tianyi Liang, Changbo Wang, and Chenhui Li. Music2palette: Emotion-aligned color palette generation via cross-modal representation learning.arXiv preprint arXiv:2507.04758, 2025. 2

work page arXiv 2025

[14] [14]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

work page

[15] [15]

Image sentiment anal- ysis using latent correlations among visual, textual, and sen- timent views

Marie Katsurai and Shin’ichi Satoh. Image sentiment anal- ysis using latent correlations among visual, textual, and sen- timent views. In2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2837–2841. IEEE, 2016. 2, 3

work page 2016

[16] [16]

Con- textface: Generating facial expressions from emotional con- texts

Min-jung Kim, Minsang Kim, and Seung Jun Baek. Con- textface: Generating facial expressions from emotional con- texts. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 11383–11392, 2025. 2

work page 2025

[17] [17]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2

work page 2023

[18] [18]

Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017. 2

work page 2017

[19] [19]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1

work page 2022

[20] [20]

Make me happier: Evoking emotions through image diffusion models

Qing Lin, Jingfeng Zhang, Yew-Soon Ong, and Mengmi Zhang. Make me happier: Evoking emotions through image diffusion models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 16367–16376,

work page

[21] [21]

Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation

Yukang Lin, Hokit Fung, Jianjin Xu, Zeping Ren, Adela SM Lau, Guosheng Yin, and Xiu Li. Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26242–26252, 2025. 2

work page 2025

[22] [22]

Moee: Mixture of emotion experts for audio-driven portrait animation

Huaize Liu, Wenzhang Sun, Donglin Di, Shibo Sun, Jiahui Yang, Changqing Zou, and Hujun Bao. Moee: Mixture of emotion experts for audio-driven portrait animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 26222–26231, 2025. 1

work page 2025

[23] [23]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

work page

[24] [24]

Findingemo: An im- age dataset for emotion recognition in the wild.Advances in Neural Information Processing Systems, 37:4956–4996,

Laurent Mertens, Elahe Yargholi, Hans Op de Beeck, Jan Van den Stock, and Joost Vennekens. Findingemo: An im- age dataset for emotion recognition in the wild.Advances in Neural Information Processing Systems, 37:4956–4996,

work page

[25] [25]

Emotional category data on images from the interna- tional affective picture system.Behavior research methods, 37(4):626–630, 2005

Joseph A Mikels, Barbara L Fredrickson, Gregory R Larkin, Casey M Lindberg, Sam J Maglio, and Patricia A Reuter- Lorenz. Emotional category data on images from the interna- tional affective picture system.Behavior research methods, 37(4):626–630, 2005. 2

work page 2005

[26] [26]

Psychology Press, 2017

Paula M Niedenthal and Franc ¸ois Ric.Psychology of emo- tion. Psychology Press, 2017. 2

work page 2017

[27] [27]

A mixed bag of emotions: Model, pre- dict, and transfer emotion distributions

Kuan-Chuan Peng, Tsuhan Chen, Amir Sadovnik, and An- drew C Gallagher. A mixed bag of emotions: Model, pre- dict, and transfer emotion distributions. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 860–868, 2015. 3, 7

work page 2015

[28] [28]

Emotion models: a review

Sreeja PS and G Mahalakshmi. Emotion models: a review. International Journal of Control Theory and Applications, 10(8):651–657, 2017. 2, 4

work page 2017

[29] [29]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1

work page 2021

[30] [30]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational confer- ence on machine learning, pages 8821–8831. Pmlr, 2021. 1

work page 2021

[31] [31]

Learning from crowds.Journal of machine learning research, 11(4), 2010

Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Her- mosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds.Journal of machine learning research, 11(4), 2010. 2

work page 2010

[32] [32]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015. 2

work page 2015

[33] [33]

Labelme: a database and web-based tool for image annotation.International journal of computer vision, 77(1):157–173, 2008

Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman. Labelme: a database and web-based tool for image annotation.International journal of computer vision, 77(1):157–173, 2008. 2

work page 2008

[34] [34]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Cocoer: Aligning multi- level feature by competition and coordination for emotion recognition

Xuli Shen, Hua Cai, Weilin Shen, Qing Xu, Dingding Yu, Weifeng Ge, and Xiangyang Xue. Cocoer: Aligning multi- level feature by competition and coordination for emotion recognition. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 29591–29600, 2025. 2

work page 2025

[36] [36]

Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks

Rion Snow, Brendan O’connor, Dan Jurafsky, and Andrew Y Ng. Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks. InProceedings of the 2008 conference on empirical methods in natural language processing, pages 254–263, 2008. 2

work page 2008

[37] [37]

Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion

Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, et al. Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26212–26221, 2025. 2

work page 2025

[38] [38]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2, 5

work page 2022

[39] [39]

Online crowdsourcing: rating annotators and obtaining cost-effective labels

Peter Welinder and Pietro Perona. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pages 25–32. IEEE, 2010. 2

work page 2010

[40] [40]

Emotion across modalities and cul- tures: Multilingual multimodal emotion-cause analysis with memory-inspired framework

Dan Wu, Xincheng Ju, Dong Zhang, Shoushan Li, Erik Cam- bria, and Guodong Zhou. Emotion across modalities and cul- tures: Multilingual multimodal emotion-cause analysis with memory-inspired framework. InProceedings of ACM MM,

work page

[41] [41]

Seek common ground while reserving differences: Semi- supervised image-text sentiment recognition

Wuyou Xia, Guoli Jia, Sicheng Zhao, and Jufeng Yang. Seek common ground while reserving differences: Semi- supervised image-text sentiment recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29601–29611, 2025. 2

work page 2025

[42] [42]

Emovit: Revolutionizing emotion insights with vi- sual instruction tuning

Hongxia Xie, Chu-Jun Peng, Yu-Wen Tseng, Hung-Jen Chen, Chan-Feng Hsu, Hong-Han Shuai, and Wen-Huang Cheng. Emovit: Revolutionizing emotion insights with vi- sual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26596–26605, 2024. 2, 5

work page 2024

[43] [43]

Emoset: A large-scale visual emotion dataset with rich attributes

Jingyuan Yang, Qirui Huang, Tingting Ding, Dani Lischin- ski, Danny Cohen-Or, and Hui Huang. Emoset: A large-scale visual emotion dataset with rich attributes. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 20383–20394, 2023. 2, 3, 6, 7

work page 2023

[44] [44]

Emogen: Emotional image content generation with text-to-image dif- fusion models

Jingyuan Yang, Jiawei Feng, and Hui Huang. Emogen: Emotional image content generation with text-to-image dif- fusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6358– 6368, 2024. 2

work page 2024

[45] [45]

Emoedit: Evoking emo- tions through image manipulation

Jingyuan Yang, Jiawei Feng, Weibin Luo, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Emoedit: Evoking emo- tions through image manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24690–24699, 2025. 2

work page 2025

[46] [46]

Uncertain multimodal intention and emotion understanding in the wild

Qu Yang, Qinghongya Shi, Tongxin Wang, and Mang Ye. Uncertain multimodal intention and emotion understanding in the wild. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 24700–24709, 2025. 3

work page 2025

[47] [47]

Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervised cross-domain visual emotion recognition

Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, and Tao He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervised cross-domain visual emotion recognition. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 3888–3898, 2025. 3

work page 2025

[48] [48]

Building a large scale dataset for image emotion recognition: The fine print and the benchmark

Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Building a large scale dataset for image emotion recognition: The fine print and the benchmark. InProceedings of the AAAI conference on artificial intelligence, 2016. 2, 3, 6

work page 2016

[49] [49]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 3, 7

work page 2014

[50] [50]

Emoart: A multidimensional dataset for emotion-aware artistic generation.arXiv preprint arXiv:2506.03652, 2025

Cheng Zhang, Bin Wen, Songhan Zuo, Ruoxuan Zhang, Wen-huang Cheng, et al. Emoart: A multidimensional dataset for emotion-aware artistic generation.arXiv preprint arXiv:2506.03652, 2025. 2, 3, 7

work page arXiv 2025

[51] [51]

Emit: Emotional interac- tion control in text-to-image diffusion models

Haofan Zhang and Shangfei Wang. Emit: Emotional interac- tion control in text-to-image diffusion models. InProceed- ings of the 33rd ACM International Conference on Multime- dia, pages 9950–9958, 2025. 1

work page 2025

[52] [52]

Open vocabulary emotion prediction based on large mul- timodal models

Zixing Zhang, Zhongren Dong, Zhiqiang Gao, Shihao Gao, Donghao Wang, Ciqiang Chen, Yuhan Nie, and Huan Zhao. Open vocabulary emotion prediction based on large mul- timodal models. InProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Comput- ing, pages 99–103, 2024. 3

work page 2024

[53] [53]

Predicting personalized image emotion per- ceptions in social networks.IEEE transactions on affective computing, 9(4):526–540, 2016

Sicheng Zhao, Hongxun Yao, Yue Gao, Guiguang Ding, and Tat-Seng Chua. Predicting personalized image emotion per- ceptions in social networks.IEEE transactions on affective computing, 9(4):526–540, 2016. 2, 4

work page 2016

[54] [54]

Emosym: A symbiotic framework for uni- fied emotional understanding and generation via latent rea- soning

Yijie Zhu, Yibo Lyu, Zitong Yu, Rui Shao, Kaiyang Zhou, and Liqiang Nie. Emosym: A symbiotic framework for uni- fied emotional understanding and generation via latent rea- soning. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 5451–5460, 2025. 3

work page 2025