Navigating the Emotion Tree: Hierarchical Hyperbolic RAG for Multimodal Emotion Recognition
Pith reviewed 2026-05-20 15:51 UTC · model grok-4.3
The pith
HyperEmo-RAG embeds emotion taxonomies and multimodal inputs in hyperbolic space to retrieve hierarchical evidence for fine-grained recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Jointly embedding hierarchical emotion labels and multimodal samples into a Poincaré ball enables a hierarchical beam-search process that retrieves evidence from coarse to fine-grained levels; the resulting evidence graph is then injected into the LLM via Tree-Aware Attention and an EmotionGraphFormer, preserving graph structure and yielding higher performance than flat-label baselines.
What carries the argument
Hierarchical hyperbolic grounding: the joint embedding of the emotion taxonomy tree and input samples into the Poincaré ball, together with the beam-search deliberation that moves from broad to specific categories.
If this is right
- Fine-grained emotion categories become distinguishable by following the established psychological hierarchy rather than guessing among flat labels.
- Structured external evidence reduces the impact of noisy or ambiguous multimodal cues during inference.
- The evidence graph and graph-aware injection layers allow the language model to use relational knowledge without flattening it into text.
- Performance gains appear consistently across multiple standard multimodal emotion datasets.
Where Pith is reading between the lines
- The same coarse-to-fine hyperbolic retrieval could be tested on other label hierarchies such as medical symptom trees or product category taxonomies.
- Dynamic updates to the emotion knowledge base would allow the system to track evolving psychological classifications without retraining the entire model.
- Pairing the framework with larger multimodal foundation models might produce more coherent affective reasoning in open-ended dialogue settings.
Load-bearing premise
Emotion taxonomies possess a stable hierarchical tree structure that can be embedded jointly with multimodal data in hyperbolic space so that progressive retrieval improves classification.
What would settle it
Replacing the Poincaré-ball embedding and hierarchical beam search with standard Euclidean retrieval on the same datasets and finding no measurable drop in fine-grained accuracy would falsify the central mechanism.
Figures
read the original abstract
Multimodal emotion recognition aims to integrate text, audio, and video sources to understand human affective states. Although multimodal large language models excel at multimodal reasoning, they typically treat emotion categories as independent labels, ignoring the rich hierarchical taxonomy of human psychology. Moreover, lacking external contextual knowledge makes them highly susceptible to over-interpreting noisy cues, further complicating fine-grained emotion classification. To address these issues, we propose \textbf{HyperEmo-RAG}, a retrieval-augmented generation framework that leverages a structured emotional knowledge base. Our framework introduces two key innovations. 1) Hierarchical hyperbolic grounding. Recognizing the inherent hierarchical tree structure of emotion taxonomies, we jointly embed hierarchical emotion labels and multimodal samples into a continuous hyperbolic space (Poincar\'e ball) and design a hierarchical beam-search deliberation process that progressively retrieves samples from coarse to fine-grained levels. 2) Structured evidence injection. Based on the retrieved evidence, we construct an evidence graph and inject the structured knowledge as explicit cognitive context into the LLM through a Tree-Aware Attention mechanism and an EmotionGraphFormer, preserving the integrity of graph-structured information. Experiments on multiple datasets demonstrate that HyperEmo-RAG significantly outperforms existing methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes HyperEmo-RAG, a retrieval-augmented generation framework for multimodal emotion recognition. It introduces hierarchical hyperbolic grounding by jointly embedding emotion taxonomies and multimodal samples into a Poincaré ball, followed by a hierarchical beam-search process for progressive retrieval from coarse to fine-grained levels. Structured evidence is injected via an evidence graph, Tree-Aware Attention, and EmotionGraphFormer into an LLM. The paper claims that experiments on multiple datasets show significant outperformance over existing methods.
Significance. If the empirical results hold and the hierarchical structure is effectively preserved in the hyperbolic space, this work could contribute to improving fine-grained multimodal emotion recognition by leveraging external hierarchical knowledge bases and hyperbolic geometry to model emotion taxonomies, potentially reducing over-interpretation of noisy cues in MLLMs.
major comments (2)
- [Hierarchical Hyperbolic Grounding] The description of jointly embedding hierarchical emotion labels and multimodal samples into the Poincaré ball does not specify any auxiliary loss or regularizer to enforce hierarchy preservation (e.g., smaller hyperbolic distances for parent-child pairs than for siblings). Standard multimodal contrastive losses may not suffice to ensure the geometry encodes the tree structure, which is load-bearing for the claimed effectiveness of the hierarchical beam-search deliberation process.
- [Experiments] The abstract states that HyperEmo-RAG significantly outperforms existing methods on multiple datasets, but the provided text lacks details on the specific datasets, baselines, evaluation metrics, ablation studies, or statistical significance tests. This makes it difficult to assess whether the data support the central claim of improvement due to the proposed innovations.
minor comments (1)
- [Abstract] The abstract mentions 'EmotionGraphFormer' without prior definition or reference; a brief explanation or citation would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to improve clarity and completeness.
read point-by-point responses
-
Referee: [Hierarchical Hyperbolic Grounding] The description of jointly embedding hierarchical emotion labels and multimodal samples into the Poincaré ball does not specify any auxiliary loss or regularizer to enforce hierarchy preservation (e.g., smaller hyperbolic distances for parent-child pairs than for siblings). Standard multimodal contrastive losses may not suffice to ensure the geometry encodes the tree structure, which is load-bearing for the claimed effectiveness of the hierarchical beam-search deliberation process.
Authors: We agree that the current description of the hierarchical hyperbolic grounding lacks explicit detail on the mechanisms used to preserve the emotion taxonomy structure. The joint embedding combines a multimodal contrastive objective with hyperbolic distance terms that penalize violations of parent-child proximity relative to sibling or unrelated pairs; however, this was not sufficiently elaborated. In the revised manuscript, we will add a dedicated subsection detailing the full embedding loss, including the specific regularizer that enforces smaller Poincaré distances for hierarchical relations, along with its weighting and integration with the beam-search process. revision: yes
-
Referee: [Experiments] The abstract states that HyperEmo-RAG significantly outperforms existing methods on multiple datasets, but the provided text lacks details on the specific datasets, baselines, evaluation metrics, ablation studies, or statistical significance tests. This makes it difficult to assess whether the data support the central claim of improvement due to the proposed innovations.
Authors: We acknowledge that the abstract and experimental presentation would benefit from greater specificity to allow readers to fully evaluate the claims. The full manuscript contains results across standard multimodal emotion datasets with comparisons to recent baselines and ablations, but these elements were not summarized at a sufficient level of detail. In the revision, we will update the abstract to name the primary datasets and metrics, and we will expand the experiments section to include comprehensive tables for baselines, ablation variants, evaluation protocols, and statistical significance testing (e.g., paired t-tests or Wilcoxon tests with p-values). revision: yes
Circularity Check
No circularity: framework uses standard hyperbolic embedding and RAG without self-referential reduction
full rationale
The paper's central claims rest on a proposed architecture (hierarchical hyperbolic grounding into Poincaré ball plus Tree-Aware Attention) whose description in the abstract does not reduce any prediction or result to a fitted parameter or self-citation by construction. No equations are supplied that would make the beam-search output equivalent to its inputs, nor is a uniqueness theorem imported from the authors' prior work. Standard multimodal contrastive losses are not claimed to automatically enforce tree structure; the hierarchy is presented as an explicit design choice. Experiments are cited as external validation. This is a normal, self-contained proposal of a new retrieval-augmented pipeline.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Emotion taxonomies possess an inherent hierarchical tree structure.
invented entities (1)
-
EmotionGraphFormer
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/CostcostAlphaLog_high_calibrated_iff echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
jointly embed hierarchical emotion labels and multimodal samples into a continuous hyperbolic space (Poincaré ball) and design a hierarchical beam-search deliberation process that progressively retrieves samples from coarse to fine-grained levels
-
IndisputableMonolith/Foundation/BranchSelectionbranch_selection echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
tree-distance-aware multimodal contrastive loss to preserve semantic relations among emotion categories, and a path consistency loss to regularize the alignment between multimodal queries and selected fine-grained emotion prototypes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, and Mahdieh Baghshah. 2025. Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: ACL 2025
work page 2025
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, et al. 2023. Qwen Technical Report. arXiv:2309.16609 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Co- hen. 2022. MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5558–5570
work page 2022
- [4]
-
[5]
Zebang Cheng, Yuxiang Lin, Zhiqi Chen, Xiang Li, Shuyi Mao, Fan Zhang, Dongdong Ding, Bowen Zhang, and Xiaojiang Peng. 2023. Semi- supervised multimodal emotion recognition with expression MAE. In Proceedings of the 31st ACM International Conference on Multimedia. 9436–9440
work page 2023
-
[6]
Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G Taylor. 2001. Emotion recognition in human-computer interaction.IEEE Signal Processing Magazine18, 1 (2001), 32–80
work page 2001
-
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
-
[8]
BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies. 4171–4186
work page 2019
-
[9]
Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases.Pattern Recognition44, 3 (2011), 572–587
work page 2011
-
[10]
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, et al. 2024. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
J. Guo, J. Tang, W. Dai, Y. Ding, and W. Kong. 2022. Dynamically Adjust Word Representations Using Unaligned Multimodal Information. In Proceedings of the 30th ACM International Conference on Multimedia. 3394–3402
work page 2022
-
[12]
Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for mul- timodal sentiment analysis. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9180–9192
work page 2021
-
[13]
Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and-Specific Representations for Multimodal Sentiment Analysis. InProceedings of the 28th ACM International Con- ference on Multimedia. 1122–1131
work page 2020
-
[14]
D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo. 2022. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. InICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7037–7041
work page 2022
-
[15]
Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. 2022. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 7837–7851
work page 2022
-
[16]
J. Hu, Y. Liu, J. Zhao, and Q. Jin. 2021. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conver- sation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
work page 2021
-
[17]
Ross, Cordelia Schmid, and Karteek Alahari
Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David A. Ross, Cordelia Schmid, and Karteek Alahari. 2023. REVEAL: Retrieval- Augmented Visual-Language Pre-Training with Multi-Source Multi- modal Knowledge Memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 23369–23379
work page 2023
-
[18]
Ao Jia, Yu He, Yazhou Zhang, Sagar Uprety, Dawei Song, and Christina Lioma. 2022. Beyond emotion: A multi-modal dataset for human desire understanding. InProceedings of the 2022 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies. 1512–1522
work page 2022
-
[19]
Xu, Luyu Gao, Zhiqing Sun, Qiao Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig
Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qiao Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active Retrieval Augmented Generation. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing (EMNLP). 7969–7992
work page 2023
- [20]
-
[21]
J. Li, X. Wang, G. Lv, and Z. Zeng. 2023. GA2MIF: Graph and Attention Based Two-Stage Multi-Source Information Fusion for Conversational Emotion Detection. (2023)
work page 2023
-
[22]
Shan Li and Weihong Deng. 2022. Deep facial expression recognition: A survey.IEEE Transactions on Affective Computing13, 3 (2022), 1195– 1215
work page 2022
- [23]
-
[24]
Zaijing Li, Fengxiao Tang, Ming Zhao, and Yusen Zhu. 2022. EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition. InFindings of the Association for Computational Linguistics: ACL 2022. 1610–1618
work page 2022
-
[25]
Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, and Jianhua Tao. 2025. AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models. InProceedings of the 42nd International Conference on Machine Learning
work page 2025
-
[26]
Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqiuyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, and Kai Gao. 2022. Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module. InProceedings of the 2022 International Conference on Multimodal Interaction (ICMI). ACM, 247–258. doi:10.1145/3536221. 3556569
-
[27]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low- rank multimodal fusion with modality-specific factors. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2247–2256
work page 2018
- [28]
-
[29]
Zhishu Liu, Kaishen Yuan, Bo Zhao, Yong Xu, and Zitong Yu. 2025. Au-llm: Micro-expression action unit detection via enhanced llm-based feature fusion. InChinese Conference on Biometric Recognition. Springer, 355–365
work page 2025
-
[30]
F. Lv, X. Chen, Y. Huang, L. Duan, and G. Lin. 2021. Progressive Modal- ity Reinforcement for Human Multimodal Emotion Recognition from Unaligned Multimodal Sequences. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. 2554–2562. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al
work page 2021
-
[31]
Rosalind W Picard, Elias Vyzas, and Jennifer Healey. 2001. Toward machine emotional intelligence: Analysis of affective physiological state.IEEE Transactions on Pattern Analysis and Machine Intelligence23, 10 (2001), 1175–1191
work page 2001
-
[32]
Vittorio Pipoli, Alessia Saporita, Federico Bolelli, Marcella Cornia, Lorenzo Baraldi, Costantino Grana, Rita Cucchiara, and Elisa Ficarra
-
[33]
InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)
MISSRAG: Addressing the Missing Modality Challenge in Multi- modal Large Language Models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)
-
[34]
Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal emotion recognition.Information Fusion37 (2017), 98–125
work page 2017
-
[35]
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. InPro- ceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics. Association for Computational Linguistics, 527–536. doi:10.18653/v1/P19-1050
-
[36]
Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque
-
[37]
InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Integrating multimodal information in large pretrained trans- formers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2359–2369
-
[38]
Björn W Schuller. 2018. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends.Commun. ACM61, 5 (2018), 90–99
work page 2018
-
[39]
Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih- Fu Chang, and Maja Pantic. 2017. A survey of multimodal emotion recognition.Image and Vision Computing65 (2017), 1–14
work page 2017
-
[40]
Jun Sun, Shoukang Han, Yu-Ping Ruan, Xiaoning Zhang, Shu-Kai Zheng, Yulong Liu, Yuxin Huang, and Taihao Li. 2023. Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recogni- tion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Com- putational Lin...
-
[41]
J. Sun, S. Han, Y.-P. Ruan, X. Zhang, S.-K. Zheng, Y. Liu, Y. Huang, and T. Li. 2023. Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recognition. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 658–670
work page 2023
-
[42]
Hugo Touvron, Louis Martin, Kevin Stone, et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Zico Kolter, Louis- Philippe Morency, and Ruslan Salakhutdinov
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis- Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Trans- former for Unaligned Multimodal Language Sequences. InProceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics
work page 2019
-
[44]
Ma, Muqiao Yang, Ruslan Salakhutdi- nov, and Louis-Philippe Morency
Yao-Hung Hubert Tsai, Martin Q. Ma, Muqiao Yang, Ruslan Salakhutdi- nov, and Louis-Philippe Morency. 2020. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
work page 2020
-
[45]
Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao
-
[46]
Region attention networks for pose and occlusion robust facial expression recognition.IEEE Transactions on Image Processing29 (2020), 4057–4069
work page 2020
-
[47]
Taorui Wang, Xun Lin, Yong Xu, Qilang Ye, Dan Guo, Sergio Escalera, Ghada Khoriba, and Zitong Yu. 2026. Micro-gesture Recognition: A Comprehensive Survey of Datasets, Methods, and Challenges.Machine Intelligence Research23, 2 (2026), 308–330. doi:10.1007/s11633-025-1629- x
- [48]
-
[49]
Zeheng Wang, Zitong Yu, Yijie Zhu, Bo Zhao, Haochen Liang, Taorui Wang, Wei Xia, Jiayu Zhang, Zhishu Liu, Hui Ma, Fei Ma, and Qi Tian. 2026. AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition.arXiv preprint arXiv:2604.12735(2026). arXiv:2604.12735 https://arxiv.org/abs/2604. 12735
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Zhuofan Wen, Zheng Lian, Shun Chen, Hailiang Yao, Longjiang Yang, Bin Liu, and Jianhua Tao. 2025. Listen, Watch, and Learn to Feel: Retrieval-Augmented Emotion Reasoning for Compound Emotion Gen- eration. InFindings of the Association for Computational Linguistics: ACL 2025
work page 2025
- [51]
- [52]
-
[53]
Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Richard James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen- tau Yih. 2023. Retrieval-Augmented Multimodal Language Modeling. InProceedings of the 40th International Conference on Machine Learning (ICML). 39755–39769
work page 2023
-
[54]
W. Yu, H. Xu, Z. Yuan, and J. Wu. 2021. Learning Modality-Specific Rep- resentations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10790–10797
work page 2021
- [55]
-
[56]
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. 2017. Tensor fusion network for multimodal sen- timent analysis. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1103–1114
work page 2017
-
[57]
Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. 2018. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Lin- guistics, 223...
-
[58]
AmirAli Bagher Zadeh, Paul Pu Liang, Jonathan Vanbriesen, Soujanya Poria, Edmund Tong, Erik Cambria, Minghai Chen, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2236–2246
work page 2018
-
[59]
Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understand- ing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations. 443–453
work page 2023
-
[60]
Sitao Zhang, Yimu Pan, and James Z. Wang. 2023. Learning emotion rep- resentations from verbal and nonverbal communication. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18993–19004
work page 2023
-
[61]
Zengqun Zhao, Qingshan Liu, and Feng Zhou. 2022. Robust facial expression recognition: A survey.IEEE Transactions on Affective Com- puting13, 4 (2022), 1805–1823
work page 2022
- [62]
-
[63]
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elho- seiny. 2024. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. InInternational Conference on Learning Representations
work page 2024
-
[64]
Yijie Zhu, Yibo Lyu, Zitong Yu, Rui Shao, Kaiyang Zhou, and Liqiang Nie. 2025. EmoSym: A Symbiotic Framework for Unified Emotional Understanding and Generation via Latent Reasoning. InProceedings of the 33nd ACM International Conference on Multimedia
work page 2025
-
[65]
Yijie Zhu, Lingsen Zhang, Zitong Yu, Rui Shao, Tao Tan, and Liqiang Nie. 2025. UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries.arXiv preprint arXiv:2507.23372(2025)
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.