pith. sign in

arxiv: 2605.18884 · v1 · pith:YDLAQ7A2new · submitted 2026-05-16 · 💻 cs.LG · cs.CV

Navigating the Emotion Tree: Hierarchical Hyperbolic RAG for Multimodal Emotion Recognition

Pith reviewed 2026-05-20 15:51 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords multimodal emotion recognitionhyperbolic embeddingsretrieval-augmented generationPoincaré ballhierarchical retrievalemotion taxonomystructured knowledge injectionTree-Aware Attention
0
0 comments X

The pith

HyperEmo-RAG embeds emotion taxonomies and multimodal inputs in hyperbolic space to retrieve hierarchical evidence for fine-grained recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a retrieval-augmented framework that respects the natural tree structure of human emotions instead of treating categories as unrelated labels. It places both the emotion hierarchy and samples from text, audio, and video into a Poincaré ball so a beam-search process can pull relevant examples first at broad levels and then at specific ones. The retrieved items become a graph that is injected into a large language model through specialized attention layers, supplying explicit relational context. This combination is intended to limit errors from noisy signals and improve accuracy on detailed emotion distinctions. If the method works as described, multimodal systems would gain a reliable way to draw on external psychological structure during inference.

Core claim

Jointly embedding hierarchical emotion labels and multimodal samples into a Poincaré ball enables a hierarchical beam-search process that retrieves evidence from coarse to fine-grained levels; the resulting evidence graph is then injected into the LLM via Tree-Aware Attention and an EmotionGraphFormer, preserving graph structure and yielding higher performance than flat-label baselines.

What carries the argument

Hierarchical hyperbolic grounding: the joint embedding of the emotion taxonomy tree and input samples into the Poincaré ball, together with the beam-search deliberation that moves from broad to specific categories.

If this is right

  • Fine-grained emotion categories become distinguishable by following the established psychological hierarchy rather than guessing among flat labels.
  • Structured external evidence reduces the impact of noisy or ambiguous multimodal cues during inference.
  • The evidence graph and graph-aware injection layers allow the language model to use relational knowledge without flattening it into text.
  • Performance gains appear consistently across multiple standard multimodal emotion datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coarse-to-fine hyperbolic retrieval could be tested on other label hierarchies such as medical symptom trees or product category taxonomies.
  • Dynamic updates to the emotion knowledge base would allow the system to track evolving psychological classifications without retraining the entire model.
  • Pairing the framework with larger multimodal foundation models might produce more coherent affective reasoning in open-ended dialogue settings.

Load-bearing premise

Emotion taxonomies possess a stable hierarchical tree structure that can be embedded jointly with multimodal data in hyperbolic space so that progressive retrieval improves classification.

What would settle it

Replacing the Poincaré-ball embedding and hierarchical beam search with standard Euclidean retrieval on the same datasets and finding no measurable drop in fine-grained accuracy would falsify the central mechanism.

Figures

Figures reproduced from arXiv: 2605.18884 by Bo Zhao, Hui Ma, Qianyu Xie, Ruixin Zhang, Shouhong Ding, Yijie Zhu, Zeheng Wang, Zhishu Liu, Zitong Yu.

Figure 1
Figure 1. Figure 1: Comparison of emotion recognition frame [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of HyperEmo-RAG. Multimodal features (visual, acoustic, and language) are [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Parameter sensitivity analysis of HyperEmo-RAG with respect to the retrieval top-k, the number of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Multimodal emotion recognition aims to integrate text, audio, and video sources to understand human affective states. Although multimodal large language models excel at multimodal reasoning, they typically treat emotion categories as independent labels, ignoring the rich hierarchical taxonomy of human psychology. Moreover, lacking external contextual knowledge makes them highly susceptible to over-interpreting noisy cues, further complicating fine-grained emotion classification. To address these issues, we propose \textbf{HyperEmo-RAG}, a retrieval-augmented generation framework that leverages a structured emotional knowledge base. Our framework introduces two key innovations. 1) Hierarchical hyperbolic grounding. Recognizing the inherent hierarchical tree structure of emotion taxonomies, we jointly embed hierarchical emotion labels and multimodal samples into a continuous hyperbolic space (Poincar\'e ball) and design a hierarchical beam-search deliberation process that progressively retrieves samples from coarse to fine-grained levels. 2) Structured evidence injection. Based on the retrieved evidence, we construct an evidence graph and inject the structured knowledge as explicit cognitive context into the LLM through a Tree-Aware Attention mechanism and an EmotionGraphFormer, preserving the integrity of graph-structured information. Experiments on multiple datasets demonstrate that HyperEmo-RAG significantly outperforms existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes HyperEmo-RAG, a retrieval-augmented generation framework for multimodal emotion recognition. It introduces hierarchical hyperbolic grounding by jointly embedding emotion taxonomies and multimodal samples into a Poincaré ball, followed by a hierarchical beam-search process for progressive retrieval from coarse to fine-grained levels. Structured evidence is injected via an evidence graph, Tree-Aware Attention, and EmotionGraphFormer into an LLM. The paper claims that experiments on multiple datasets show significant outperformance over existing methods.

Significance. If the empirical results hold and the hierarchical structure is effectively preserved in the hyperbolic space, this work could contribute to improving fine-grained multimodal emotion recognition by leveraging external hierarchical knowledge bases and hyperbolic geometry to model emotion taxonomies, potentially reducing over-interpretation of noisy cues in MLLMs.

major comments (2)
  1. [Hierarchical Hyperbolic Grounding] The description of jointly embedding hierarchical emotion labels and multimodal samples into the Poincaré ball does not specify any auxiliary loss or regularizer to enforce hierarchy preservation (e.g., smaller hyperbolic distances for parent-child pairs than for siblings). Standard multimodal contrastive losses may not suffice to ensure the geometry encodes the tree structure, which is load-bearing for the claimed effectiveness of the hierarchical beam-search deliberation process.
  2. [Experiments] The abstract states that HyperEmo-RAG significantly outperforms existing methods on multiple datasets, but the provided text lacks details on the specific datasets, baselines, evaluation metrics, ablation studies, or statistical significance tests. This makes it difficult to assess whether the data support the central claim of improvement due to the proposed innovations.
minor comments (1)
  1. [Abstract] The abstract mentions 'EmotionGraphFormer' without prior definition or reference; a brief explanation or citation would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Hierarchical Hyperbolic Grounding] The description of jointly embedding hierarchical emotion labels and multimodal samples into the Poincaré ball does not specify any auxiliary loss or regularizer to enforce hierarchy preservation (e.g., smaller hyperbolic distances for parent-child pairs than for siblings). Standard multimodal contrastive losses may not suffice to ensure the geometry encodes the tree structure, which is load-bearing for the claimed effectiveness of the hierarchical beam-search deliberation process.

    Authors: We agree that the current description of the hierarchical hyperbolic grounding lacks explicit detail on the mechanisms used to preserve the emotion taxonomy structure. The joint embedding combines a multimodal contrastive objective with hyperbolic distance terms that penalize violations of parent-child proximity relative to sibling or unrelated pairs; however, this was not sufficiently elaborated. In the revised manuscript, we will add a dedicated subsection detailing the full embedding loss, including the specific regularizer that enforces smaller Poincaré distances for hierarchical relations, along with its weighting and integration with the beam-search process. revision: yes

  2. Referee: [Experiments] The abstract states that HyperEmo-RAG significantly outperforms existing methods on multiple datasets, but the provided text lacks details on the specific datasets, baselines, evaluation metrics, ablation studies, or statistical significance tests. This makes it difficult to assess whether the data support the central claim of improvement due to the proposed innovations.

    Authors: We acknowledge that the abstract and experimental presentation would benefit from greater specificity to allow readers to fully evaluate the claims. The full manuscript contains results across standard multimodal emotion datasets with comparisons to recent baselines and ablations, but these elements were not summarized at a sufficient level of detail. In the revision, we will update the abstract to name the primary datasets and metrics, and we will expand the experiments section to include comprehensive tables for baselines, ablation variants, evaluation protocols, and statistical significance testing (e.g., paired t-tests or Wilcoxon tests with p-values). revision: yes

Circularity Check

0 steps flagged

No circularity: framework uses standard hyperbolic embedding and RAG without self-referential reduction

full rationale

The paper's central claims rest on a proposed architecture (hierarchical hyperbolic grounding into Poincaré ball plus Tree-Aware Attention) whose description in the abstract does not reduce any prediction or result to a fitted parameter or self-citation by construction. No equations are supplied that would make the beam-search output equivalent to its inputs, nor is a uniqueness theorem imported from the authors' prior work. Standard multimodal contrastive losses are not claimed to automatically enforce tree structure; the hierarchy is presented as an explicit design choice. Experiments are cited as external validation. This is a normal, self-contained proposal of a new retrieval-augmented pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that emotion taxonomies possess a stable hierarchical tree structure that hyperbolic space can faithfully represent, plus the introduction of two new components (Tree-Aware Attention and EmotionGraphFormer) whose independent utility is not evidenced in the abstract.

axioms (1)
  • domain assumption Emotion taxonomies possess an inherent hierarchical tree structure.
    Directly invoked to justify the hierarchical hyperbolic grounding and beam-search process.
invented entities (1)
  • EmotionGraphFormer no independent evidence
    purpose: Preserve graph-structured information when injecting retrieved evidence into the LLM.
    New module introduced as part of structured evidence injection; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5769 in / 1243 out tokens · 63009 ms · 2026-05-20T15:51:20.005824+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost costAlphaLog_high_calibrated_iff echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    jointly embed hierarchical emotion labels and multimodal samples into a continuous hyperbolic space (Poincaré ball) and design a hierarchical beam-search deliberation process that progressively retrieves samples from coarse to fine-grained levels

  • IndisputableMonolith/Foundation/BranchSelection branch_selection echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    tree-distance-aware multimodal contrastive loss to preserve semantic relations among emotion categories, and a path consistency loss to regularize the alignment between multimodal queries and selected fine-grained emotion prototypes

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 5 internal anchors

  1. [1]

    Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, and Mahdieh Baghshah. 2025. Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: ACL 2025

  2. [2]

    Jinze Bai, Shuai Bai, Yunfei Chu, et al. 2023. Qwen Technical Report. arXiv:2309.16609 [cs.CL]

  3. [3]

    Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Co- hen. 2022. MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5558–5570

  4. [4]

    Hauptmann

    Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander G. Hauptmann. 2024. Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning. InAdvances in Neural Information Processing Sys- tems

  5. [5]

    Zebang Cheng, Yuxiang Lin, Zhiqi Chen, Xiang Li, Shuyi Mao, Fan Zhang, Dongdong Ding, Bowen Zhang, and Xiaojiang Peng. 2023. Semi- supervised multimodal emotion recognition with expression MAE. In Proceedings of the 31st ACM International Conference on Multimedia. 9436–9440

  6. [6]

    Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G Taylor. 2001. Emotion recognition in human-computer interaction.IEEE Signal Processing Magazine18, 1 (2001), 32–80

  7. [7]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

  8. [8]

    InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies

    BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies. 4171–4186

  9. [9]

    Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases.Pattern Recognition44, 3 (2011), 572–587

  10. [10]

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, et al. 2024. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 [cs.CL]

  11. [11]

    J. Guo, J. Tang, W. Dai, Y. Ding, and W. Kong. 2022. Dynamically Adjust Word Representations Using Unaligned Multimodal Information. In Proceedings of the 30th ACM International Conference on Multimedia. 3394–3402

  12. [12]

    Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for mul- timodal sentiment analysis. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9180–9192

  13. [13]

    Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and-Specific Representations for Multimodal Sentiment Analysis. InProceedings of the 28th ACM International Con- ference on Multimedia. 1122–1131

  14. [14]

    D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo. 2022. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. InICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7037–7041

  15. [15]

    Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. 2022. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 7837–7851

  16. [16]

    J. Hu, Y. Liu, J. Zhao, and Q. Jin. 2021. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conver- sation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

  17. [17]

    Ross, Cordelia Schmid, and Karteek Alahari

    Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David A. Ross, Cordelia Schmid, and Karteek Alahari. 2023. REVEAL: Retrieval- Augmented Visual-Language Pre-Training with Multi-Source Multi- modal Knowledge Memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 23369–23379

  18. [18]

    Ao Jia, Yu He, Yazhou Zhang, Sagar Uprety, Dawei Song, and Christina Lioma. 2022. Beyond emotion: A multi-modal dataset for human desire understanding. InProceedings of the 2022 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies. 1512–1522

  19. [19]

    Xu, Luyu Gao, Zhiqing Sun, Qiao Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

    Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qiao Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active Retrieval Augmented Generation. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing (EMNLP). 7969–7992

  20. [20]

    Shanglin Lei, Guanting Dong, Xiaoping Wang, Keheng Wang, and Sirui Wang. 2023. InstructERC: Reforming emotion recognition in conver- sation with a retrieval multi-task LLMs framework.arXiv preprint arXiv:2309.11911(2023)

  21. [21]

    J. Li, X. Wang, G. Lv, and Z. Zeng. 2023. GA2MIF: Graph and Attention Based Two-Stage Multi-Source Information Fusion for Conversational Emotion Detection. (2023)

  22. [22]

    Shan Li and Weihong Deng. 2022. Deep facial expression recognition: A survey.IEEE Transactions on Affective Computing13, 3 (2022), 1195– 1215

  23. [23]

    Zaijing Li, Ting-En Lin, Yuchuan Wu, Ming Liu, Fengxiao Tang, Ming Zhao, and Yongbin Li. 2023. UniSA: Unified Generative Framework for Sentiment Analysis. (2023). arXiv:2309.01339

  24. [24]

    Zaijing Li, Fengxiao Tang, Ming Zhao, and Yusen Zhu. 2022. EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition. InFindings of the Association for Computational Linguistics: ACL 2022. 1610–1618

  25. [25]

    Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, and Jianhua Tao. 2025. AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models. InProceedings of the 42nd International Conference on Machine Learning

  26. [26]

    Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqiuyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, and Kai Gao. 2022. Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module. InProceedings of the 2022 International Conference on Multimodal Interaction (ICMI). ACM, 247–258. doi:10.1145/3536221. 3556569

  27. [27]

    Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low- rank multimodal fusion with modality-specific factors. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2247–2256

  28. [28]

    Zhishu Liu, Kaishen Yuan, Bo Zhao, Hui Ma, and Zitong Yu. 2026. AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition. arXiv:2603.08387 [cs.CV] https://arxiv. org/abs/2603.08387

  29. [29]

    Zhishu Liu, Kaishen Yuan, Bo Zhao, Yong Xu, and Zitong Yu. 2025. Au-llm: Micro-expression action unit detection via enhanced llm-based feature fusion. InChinese Conference on Biometric Recognition. Springer, 355–365

  30. [30]

    F. Lv, X. Chen, Y. Huang, L. Duan, and G. Lin. 2021. Progressive Modal- ity Reinforcement for Human Multimodal Emotion Recognition from Unaligned Multimodal Sequences. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. 2554–2562. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

  31. [31]

    Rosalind W Picard, Elias Vyzas, and Jennifer Healey. 2001. Toward machine emotional intelligence: Analysis of affective physiological state.IEEE Transactions on Pattern Analysis and Machine Intelligence23, 10 (2001), 1175–1191

  32. [32]

    Vittorio Pipoli, Alessia Saporita, Federico Bolelli, Marcella Cornia, Lorenzo Baraldi, Costantino Grana, Rita Cucchiara, and Elisa Ficarra

  33. [33]

    InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

    MISSRAG: Addressing the Missing Modality Challenge in Multi- modal Large Language Models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

  34. [34]

    Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal emotion recognition.Information Fusion37 (2017), 98–125

  35. [35]

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. InPro- ceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics. Association for Computational Linguistics, 527–536. doi:10.18653/v1/P19-1050

  36. [36]

    Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque

  37. [37]

    InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics

    Integrating multimodal information in large pretrained trans- formers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2359–2369

  38. [38]

    Björn W Schuller. 2018. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends.Commun. ACM61, 5 (2018), 90–99

  39. [39]

    Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih- Fu Chang, and Maja Pantic. 2017. A survey of multimodal emotion recognition.Image and Vision Computing65 (2017), 1–14

  40. [40]

    Jun Sun, Shoukang Han, Yu-Ping Ruan, Xiaoning Zhang, Shu-Kai Zheng, Yulong Liu, Yuxin Huang, and Taihao Li. 2023. Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recogni- tion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Com- putational Lin...

  41. [41]

    J. Sun, S. Han, Y.-P. Ruan, X. Zhang, S.-K. Zheng, Y. Liu, Y. Huang, and T. Li. 2023. Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recognition. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 658–670

  42. [42]

    Hugo Touvron, Louis Martin, Kevin Stone, et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]

  43. [43]

    Zico Kolter, Louis- Philippe Morency, and Ruslan Salakhutdinov

    Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis- Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Trans- former for Unaligned Multimodal Language Sequences. InProceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics

  44. [44]

    Ma, Muqiao Yang, Ruslan Salakhutdi- nov, and Louis-Philippe Morency

    Yao-Hung Hubert Tsai, Martin Q. Ma, Muqiao Yang, Ruslan Salakhutdi- nov, and Louis-Philippe Morency. 2020. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing

  45. [45]

    Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao

  46. [46]

    Region attention networks for pose and occlusion robust facial expression recognition.IEEE Transactions on Image Processing29 (2020), 4057–4069

  47. [47]

    Taorui Wang, Xun Lin, Yong Xu, Qilang Ye, Dan Guo, Sergio Escalera, Ghada Khoriba, and Zitong Yu. 2026. Micro-gesture Recognition: A Comprehensive Survey of Datasets, Methods, and Challenges.Machine Intelligence Research23, 2 (2026), 308–330. doi:10.1007/s11633-025-1629- x

  48. [48]

    Taorui Wang, Zitong Yu, and Yong Xu. 2025. TC-GS: Tri-plane based compression for 3D Gaussian Splatting. arXiv:2503.20221 [cs.CV] https: //arxiv.org/abs/2503.20221

  49. [49]

    Zeheng Wang, Zitong Yu, Yijie Zhu, Bo Zhao, Haochen Liang, Taorui Wang, Wei Xia, Jiayu Zhang, Zhishu Liu, Hui Ma, Fei Ma, and Qi Tian. 2026. AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition.arXiv preprint arXiv:2604.12735(2026). arXiv:2604.12735 https://arxiv.org/abs/2604. 12735

  50. [50]

    Zhuofan Wen, Zheng Lian, Shun Chen, Hailiang Yao, Longjiang Yang, Bin Liu, and Jianhua Tao. 2025. Listen, Watch, and Learn to Feel: Retrieval-Augmented Emotion Reasoning for Compound Emotion Gen- eration. InFindings of the Association for Computational Linguistics: ACL 2025

  51. [51]

    Peng Xia, Guilin Qi, et al . 2024. Multimodal Retrieval-Augmented Generation: A Survey.arXiv preprint arXiv:2402.03573(2024)

  52. [52]

    Ye, and B

    Qu Yang, M. Ye, and B. Du. 2024. EmoLLM: Multimodal Emo- tional Understanding Meets Large Language Models.arXiv preprint arXiv:2406.16442(2024)

  53. [53]

    Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Richard James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen- tau Yih. 2023. Retrieval-Augmented Multimodal Language Modeling. InProceedings of the 40th International Conference on Machine Learning (ICML). 39755–39769

  54. [54]

    W. Yu, H. Xu, Z. Yuan, and J. Wu. 2021. Learning Modality-Specific Rep- resentations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10790–10797

  55. [55]

    Zheng Yuan, Haotian Liu, Chunyuan Li, et al. 2025. A Survey on Multi- modal Retrieval-Augmented Generation.arXiv preprint arXiv:2504.08748 (2025)

  56. [56]

    Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. 2017. Tensor fusion network for multimodal sen- timent analysis. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1103–1114

  57. [57]

    Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. 2018. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Lin- guistics, 223...

  58. [58]

    AmirAli Bagher Zadeh, Paul Pu Liang, Jonathan Vanbriesen, Soujanya Poria, Edmund Tong, Erik Cambria, Minghai Chen, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2236–2246

  59. [59]

    Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understand- ing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations. 443–453

  60. [60]

    Sitao Zhang, Yimu Pan, and James Z. Wang. 2023. Learning emotion rep- resentations from verbal and nonverbal communication. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18993–19004

  61. [61]

    Zengqun Zhao, Qingshan Liu, and Feng Zhou. 2022. Robust facial expression recognition: A survey.IEEE Transactions on Affective Com- puting13, 4 (2022), 1805–1823

  62. [62]

    Ziwang Zheng, Tangli Jiao, Zebang Cheng, et al. 2023. EmoLLaVA: A Large Vision-Language Model for Emotion Recognition.arXiv preprint arXiv:2312.14415(2023)

  63. [63]

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elho- seiny. 2024. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. InInternational Conference on Learning Representations

  64. [64]

    Yijie Zhu, Yibo Lyu, Zitong Yu, Rui Shao, Kaiyang Zhou, and Liqiang Nie. 2025. EmoSym: A Symbiotic Framework for Unified Emotional Understanding and Generation via Latent Reasoning. InProceedings of the 33nd ACM International Conference on Multimedia

  65. [65]

    Yijie Zhu, Lingsen Zhang, Zitong Yu, Rui Shao, Tao Tan, and Liqiang Nie. 2025. UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries.arXiv preprint arXiv:2507.23372(2025)