EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis
Pith reviewed 2026-05-17 21:46 UTC · model grok-4.3
The pith
EmoVerse dataset decomposes emotions in images into Background-Attribute-Subject triplets grounded to visual regions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EmoVerse supplies a knowledge-graph-inspired annotation scheme that decomposes emotions into Background-Attribute-Subject triplets grounded to image regions, paired with dual CES and DES labels across more than 219k images, all produced through a multi-stage MLLM pipeline that keeps human effort low, and this structure directly enables interpretable visual emotion analysis together with a model that maps visual cues to dimensional representations while providing attribution explanations.
What carries the argument
The Background-Attribute-Subject (B-A-S) triplet decomposition that grounds each emotional element to specific visual regions in the image.
If this is right
- The dataset directly supports word-level and subject-level emotional reasoning by linking each triplet element to image regions.
- Dual CES and DES annotations allow the same images to serve both discrete category and continuous dimension analyses.
- The accompanying model maps visual cues into dimensional emotion representations while supplying attribution explanations for each prediction.
- The pipeline reduces the scale of manual annotation needed while maintaining the structure required for interpretable outputs.
Where Pith is reading between the lines
- The triplet grounding could be applied to video sequences to track how emotional elements change over time.
- Content recommendation systems might use the region-specific labels to filter or prioritize images based on targeted emotional triggers.
- Cross-cultural tests on the dataset could reveal whether the language-model-generated annotations carry biases in emotional interpretation.
Load-bearing premise
The multi-stage pipeline assumes that outputs from multi-modal large language models contain few systematic biases or inconsistencies and therefore require only light human verification to reach high reliability.
What would settle it
A controlled check in which independent human annotators compare the provided B-A-S triplet labels against the actual image content on a random sample and report low agreement rates would show the annotations do not reliably capture the intended elements.
Figures
read the original abstract
Visual Emotion Analysis (VEA) aims to bridge the affective gap between visual content and human emotional responses. Despite its promise, progress in this field remains limited by the lack of open-source and interpretable datasets. Most existing studies assign a single discrete emotion label to an entire image, offering limited insight into how visual elements contribute to emotion. In this work, we introduce EmoVerse, a large-scale open-source dataset that enables interpretable visual emotion analysis through multi-layered, knowledge-graph-inspired annotations. By decomposing emotions into Background-Attribute-Subject (B-A-S) triplets and grounding each element to visual regions, EmoVerse provides word-level and subject-level emotional reasoning. With over 219k images, the dataset further includes dual annotations in Categorical Emotion States (CES) and Dimensional Emotion Space (DES), facilitating unified discrete and continuous emotion representation. A novel multi-stage pipeline ensures high annotation reliability with minimal human effort. Finally, we introduce an interpretable model that maps visual cues into DES representations and provides detailed attribution explanations. Together, the dataset, pipeline, and model form a comprehensive foundation for advancing explainable high-level emotion understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EmoVerse, a large-scale open-source dataset of over 219k images for interpretable visual emotion analysis. Emotions are decomposed into Background-Attribute-Subject (B-A-S) triplets with visual region grounding to enable word-level and subject-level reasoning; dual annotations are provided in both Categorical Emotion States (CES) and Dimensional Emotion Space (DES). A novel multi-stage MLLM pipeline is used to generate the annotations with claimed high reliability and minimal human effort, and an interpretable model is presented that maps visual cues to DES representations with attribution explanations.
Significance. If the reliability of the B-A-S annotations and dual CES/DES labels is demonstrated, the dataset would provide a valuable resource for advancing explainable visual emotion analysis beyond single discrete labels, supporting more detailed affective reasoning and unified discrete-continuous representations in computer vision.
major comments (1)
- [Abstract and Methods/Pipeline section] The central claim that the multi-stage MLLM pipeline produces trustworthy annotations 'with high reliability and minimal human effort' (Abstract and pipeline description) lacks any supporting quantitative evidence. No inter-annotator agreement metrics (e.g., Fleiss' kappa), human-expert agreement on a sampled subset, error analysis, or bias audits for affective decomposition or visual grounding are reported. This directly undermines the trustworthiness of the 219k B-A-S triplets and dual labels, which is load-bearing for the dataset's utility.
minor comments (2)
- [Abstract] Clarify the exact number of images, triplets, and annotation coverage statistics in the abstract or early introduction for immediate context.
- [Introduction and Dataset sections] Ensure consistent terminology when referring to 'subject-level' versus 'word-level' emotional reasoning across sections.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding the need for quantitative evidence supporting the reliability of the annotations below.
read point-by-point responses
-
Referee: [Abstract and Methods/Pipeline section] The central claim that the multi-stage MLLM pipeline produces trustworthy annotations 'with high reliability and minimal human effort' (Abstract and pipeline description) lacks any supporting quantitative evidence. No inter-annotator agreement metrics (e.g., Fleiss' kappa), human-expert agreement on a sampled subset, error analysis, or bias audits for affective decomposition or visual grounding are reported. This directly undermines the trustworthiness of the 219k B-A-S triplets and dual labels, which is load-bearing for the dataset's utility.
Authors: We agree that the manuscript would be strengthened by including explicit quantitative validation of the annotation quality. In the revised version, we will add a new subsection under Methods detailing a human evaluation protocol. This will report inter-annotator agreement (Fleiss' kappa for CES labels and Pearson/Spearman correlations for DES dimensions) between the MLLM outputs and multiple human experts on a stratified sample of at least 1,000 images. We will also include a systematic error analysis of B-A-S triplet decomposition and region grounding accuracy, along with a bias audit examining potential demographic or cultural skews in the generated labels. These additions will provide direct empirical support for the pipeline's reliability claims while preserving the minimal-human-effort design. revision: yes
Circularity Check
Dataset release exhibits no circularity in claimed derivations
full rationale
The paper introduces EmoVerse as a dataset constructed via a multi-stage MLLM pipeline for B-A-S triplet annotations and dual CES/DES labels across 219k images. No mathematical derivations, predictions, fitted parameters, or uniqueness theorems are presented that reduce to the inputs by construction. The central value lies in the external usability of the released dataset and its annotations rather than any internal self-referential logic, self-citation chains, or ansatzes. Claims of high annotation reliability with minimal human effort are methodological assertions without evidence of reduction to prior author work or statistical forcing.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A novel multi-stage pipeline ensures high annotation reliability with minimal human effort... B-A-S triplets and grounding each element to visual regions
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EmoVerse provides word-level and subject-level emotional reasoning by decomposing emotions into Background-Attribute-Subject (B-A-S) triplets
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models
MultiEmo-Bench supplies 10,344 images with aggregated multi-label emotion votes from 20 annotators each to evaluate MLLMs on dominant emotion and full distribution prediction.
Reference graph
Works this paper leans on
-
[1]
Artemis: Affective language for visual art
Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, and Leonidas J Guibas. Artemis: Affective language for visual art. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11569–11579, 2021. 3
work page 2021
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
What’s the point: Semantic segmentation with point supervision
Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision. InEuropean conference on computer vision, pages 549–565. Springer, 2016. 2
work page 2016
-
[4]
Evaluating vision- language models for emotion recognition.arXiv preprint arXiv:2502.05660, 2025
Sree Bhattacharyya and James Z Wang. Evaluating vision- language models for emotion recognition.arXiv preprint arXiv:2502.05660, 2025. 4
-
[5]
Emova: Empowering language models to see, hear and speak with vivid emotions
Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, et al. Emova: Empowering language models to see, hear and speak with vivid emotions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5455–5466, 2025. 2
work page 2025
-
[6]
Emodubber: Towards high quality and emotion con- trollable movie dubbing
Gaoxiang Cong, Jiadong Pan, Liang Li, Yuankai Qi, Yuxin Peng, Anton van den Hengel, Jian Yang, and Qingming Huang. Emodubber: Towards high quality and emotion con- trollable movie dubbing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15863– 15873, 2025. 2
work page 2025
-
[7]
Amazon mechanical turk: A research tool for organizations and information systems scholars
Kevin Crowston. Amazon mechanical turk: A research tool for organizations and information systems scholars. InShap- ing the Future of ICT Research. Methods and Approaches: IFIP WG 8.2, Working Conference, Tampa, FL, USA, De- cember 13-14, 2012. Proceedings, pages 210–221. Springer,
work page 2012
-
[8]
Emoticrafter: Text-to-emotional-image generation based on valence-arousal model
Shengqi Dang, Yi He, Long Ling, Ziqing Qian, Nanxuan Zhao, and Nan Cao. Emoticrafter: Text-to-emotional-image generation based on valence-arousal model. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15218–15228, 2025. 1
work page 2025
-
[9]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2
work page 2009
-
[10]
Emoe: Modality-specific enhanced dynamic emotion experts
Yiyang Fang, Wenke Huang, Guancheng Wan, Kehua Su, and Mang Ye. Emoe: Modality-specific enhanced dynamic emotion experts. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14314–14324, 2025. 2
work page 2025
-
[11]
Riccardo Gervasi, Federico Barravecchia, Luca Mastrogia- como, and Fiorenzo Franceschini. Applications of affective computing in human-robot interaction: State-of-art and chal- lenges for manufacturing.Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Man- ufacture, 237(6-7):815–832, 2023. 1
work page 2023
-
[12]
Guanyu Hu, Dimitrios Kollias, and Xinyu Yang. Grounding emotion recognition with visual prototypes: Vega–revisiting clip in merc.arXiv preprint arXiv:2508.06564, 2025. 1
-
[13]
Jiayun Hu, Yueyi He, Tianyi Liang, Changbo Wang, and Chenhui Li. Music2palette: Emotion-aligned color palette generation via cross-modal representation learning.arXiv preprint arXiv:2507.04758, 2025. 2
-
[14]
Scaling up visual and vision-language representa- tion learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,
-
[15]
Image sentiment anal- ysis using latent correlations among visual, textual, and sen- timent views
Marie Katsurai and Shin’ichi Satoh. Image sentiment anal- ysis using latent correlations among visual, textual, and sen- timent views. In2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2837–2841. IEEE, 2016. 2, 3
work page 2016
-
[16]
Con- textface: Generating facial expressions from emotional con- texts
Min-jung Kim, Minsang Kim, and Seung Jun Baek. Con- textface: Generating facial expressions from emotional con- texts. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 11383–11392, 2025. 2
work page 2025
-
[17]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2
work page 2023
-
[18]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017. 2
work page 2017
-
[19]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1
work page 2022
-
[20]
Make me happier: Evoking emotions through image diffusion models
Qing Lin, Jingfeng Zhang, Yew-Soon Ong, and Mengmi Zhang. Make me happier: Evoking emotions through image diffusion models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 16367–16376,
-
[21]
Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation
Yukang Lin, Hokit Fung, Jianjin Xu, Zeping Ren, Adela SM Lau, Guosheng Yin, and Xiu Li. Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26242–26252, 2025. 2
work page 2025
-
[22]
Moee: Mixture of emotion experts for audio-driven portrait animation
Huaize Liu, Wenzhang Sun, Donglin Di, Shibo Sun, Jiahui Yang, Changqing Zou, and Hujun Bao. Moee: Mixture of emotion experts for audio-driven portrait animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 26222–26231, 2025. 1
work page 2025
-
[23]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,
-
[24]
Laurent Mertens, Elahe Yargholi, Hans Op de Beeck, Jan Van den Stock, and Joost Vennekens. Findingemo: An im- age dataset for emotion recognition in the wild.Advances in Neural Information Processing Systems, 37:4956–4996,
-
[25]
Joseph A Mikels, Barbara L Fredrickson, Gregory R Larkin, Casey M Lindberg, Sam J Maglio, and Patricia A Reuter- Lorenz. Emotional category data on images from the interna- tional affective picture system.Behavior research methods, 37(4):626–630, 2005. 2
work page 2005
-
[26]
Paula M Niedenthal and Franc ¸ois Ric.Psychology of emo- tion. Psychology Press, 2017. 2
work page 2017
-
[27]
A mixed bag of emotions: Model, pre- dict, and transfer emotion distributions
Kuan-Chuan Peng, Tsuhan Chen, Amir Sadovnik, and An- drew C Gallagher. A mixed bag of emotions: Model, pre- dict, and transfer emotion distributions. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 860–868, 2015. 3, 7
work page 2015
-
[28]
Sreeja PS and G Mahalakshmi. Emotion models: a review. International Journal of Control Theory and Applications, 10(8):651–657, 2017. 2, 4
work page 2017
-
[29]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1
work page 2021
-
[30]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational confer- ence on machine learning, pages 8821–8831. Pmlr, 2021. 1
work page 2021
-
[31]
Learning from crowds.Journal of machine learning research, 11(4), 2010
Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Her- mosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds.Journal of machine learning research, 11(4), 2010. 2
work page 2010
-
[32]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015. 2
work page 2015
-
[33]
Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman. Labelme: a database and web-based tool for image annotation.International journal of computer vision, 77(1):157–173, 2008. 2
work page 2008
-
[34]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Cocoer: Aligning multi- level feature by competition and coordination for emotion recognition
Xuli Shen, Hua Cai, Weilin Shen, Qing Xu, Dingding Yu, Weifeng Ge, and Xiangyang Xue. Cocoer: Aligning multi- level feature by competition and coordination for emotion recognition. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 29591–29600, 2025. 2
work page 2025
-
[36]
Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks
Rion Snow, Brendan O’connor, Dan Jurafsky, and Andrew Y Ng. Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks. InProceedings of the 2008 conference on empirical methods in natural language processing, pages 254–263, 2008. 2
work page 2008
-
[37]
Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, et al. Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26212–26221, 2025. 2
work page 2025
-
[38]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2, 5
work page 2022
-
[39]
Online crowdsourcing: rating annotators and obtaining cost-effective labels
Peter Welinder and Pietro Perona. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pages 25–32. IEEE, 2010. 2
work page 2010
-
[40]
Dan Wu, Xincheng Ju, Dong Zhang, Shoushan Li, Erik Cam- bria, and Guodong Zhou. Emotion across modalities and cul- tures: Multilingual multimodal emotion-cause analysis with memory-inspired framework. InProceedings of ACM MM,
-
[41]
Seek common ground while reserving differences: Semi- supervised image-text sentiment recognition
Wuyou Xia, Guoli Jia, Sicheng Zhao, and Jufeng Yang. Seek common ground while reserving differences: Semi- supervised image-text sentiment recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29601–29611, 2025. 2
work page 2025
-
[42]
Emovit: Revolutionizing emotion insights with vi- sual instruction tuning
Hongxia Xie, Chu-Jun Peng, Yu-Wen Tseng, Hung-Jen Chen, Chan-Feng Hsu, Hong-Han Shuai, and Wen-Huang Cheng. Emovit: Revolutionizing emotion insights with vi- sual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26596–26605, 2024. 2, 5
work page 2024
-
[43]
Emoset: A large-scale visual emotion dataset with rich attributes
Jingyuan Yang, Qirui Huang, Tingting Ding, Dani Lischin- ski, Danny Cohen-Or, and Hui Huang. Emoset: A large-scale visual emotion dataset with rich attributes. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 20383–20394, 2023. 2, 3, 6, 7
work page 2023
-
[44]
Emogen: Emotional image content generation with text-to-image dif- fusion models
Jingyuan Yang, Jiawei Feng, and Hui Huang. Emogen: Emotional image content generation with text-to-image dif- fusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6358– 6368, 2024. 2
work page 2024
-
[45]
Emoedit: Evoking emo- tions through image manipulation
Jingyuan Yang, Jiawei Feng, Weibin Luo, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Emoedit: Evoking emo- tions through image manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24690–24699, 2025. 2
work page 2025
-
[46]
Uncertain multimodal intention and emotion understanding in the wild
Qu Yang, Qinghongya Shi, Tongxin Wang, and Mang Ye. Uncertain multimodal intention and emotion understanding in the wild. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 24700–24709, 2025. 3
work page 2025
-
[47]
Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, and Tao He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervised cross-domain visual emotion recognition. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 3888–3898, 2025. 3
work page 2025
-
[48]
Building a large scale dataset for image emotion recognition: The fine print and the benchmark
Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Building a large scale dataset for image emotion recognition: The fine print and the benchmark. InProceedings of the AAAI conference on artificial intelligence, 2016. 2, 3, 6
work page 2016
-
[49]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 3, 7
work page 2014
-
[50]
Cheng Zhang, Bin Wen, Songhan Zuo, Ruoxuan Zhang, Wen-huang Cheng, et al. Emoart: A multidimensional dataset for emotion-aware artistic generation.arXiv preprint arXiv:2506.03652, 2025. 2, 3, 7
-
[51]
Emit: Emotional interac- tion control in text-to-image diffusion models
Haofan Zhang and Shangfei Wang. Emit: Emotional interac- tion control in text-to-image diffusion models. InProceed- ings of the 33rd ACM International Conference on Multime- dia, pages 9950–9958, 2025. 1
work page 2025
-
[52]
Open vocabulary emotion prediction based on large mul- timodal models
Zixing Zhang, Zhongren Dong, Zhiqiang Gao, Shihao Gao, Donghao Wang, Ciqiang Chen, Yuhan Nie, and Huan Zhao. Open vocabulary emotion prediction based on large mul- timodal models. InProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Comput- ing, pages 99–103, 2024. 3
work page 2024
-
[53]
Sicheng Zhao, Hongxun Yao, Yue Gao, Guiguang Ding, and Tat-Seng Chua. Predicting personalized image emotion per- ceptions in social networks.IEEE transactions on affective computing, 9(4):526–540, 2016. 2, 4
work page 2016
-
[54]
Yijie Zhu, Yibo Lyu, Zitong Yu, Rui Shao, Kaiyang Zhou, and Liqiang Nie. Emosym: A symbiotic framework for uni- fied emotional understanding and generation via latent rea- soning. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 5451–5460, 2025. 3
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.