Navigating the Emotion Tree: Hierarchical Hyperbolic RAG for Multimodal Emotion Recognition

Bo Zhao; Hui Ma; Qianyu Xie; Ruixin Zhang; Shouhong Ding; Yijie Zhu; Zeheng Wang; Zhishu Liu; Zitong Yu

arxiv: 2605.18884 · v1 · pith:YDLAQ7A2new · submitted 2026-05-16 · 💻 cs.LG · cs.CV

Navigating the Emotion Tree: Hierarchical Hyperbolic RAG for Multimodal Emotion Recognition

Zeheng Wang , Bo Zhao , Yijie Zhu , Zhishu Liu , Hui Ma , Ruixin Zhang , Shouhong Ding , Qianyu Xie

show 1 more author

Zitong Yu

This is my paper

Pith reviewed 2026-05-20 15:51 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords multimodal emotion recognitionhyperbolic embeddingsretrieval-augmented generationPoincaré ballhierarchical retrievalemotion taxonomystructured knowledge injectionTree-Aware Attention

0 comments

The pith

HyperEmo-RAG embeds emotion taxonomies and multimodal inputs in hyperbolic space to retrieve hierarchical evidence for fine-grained recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a retrieval-augmented framework that respects the natural tree structure of human emotions instead of treating categories as unrelated labels. It places both the emotion hierarchy and samples from text, audio, and video into a Poincaré ball so a beam-search process can pull relevant examples first at broad levels and then at specific ones. The retrieved items become a graph that is injected into a large language model through specialized attention layers, supplying explicit relational context. This combination is intended to limit errors from noisy signals and improve accuracy on detailed emotion distinctions. If the method works as described, multimodal systems would gain a reliable way to draw on external psychological structure during inference.

Core claim

Jointly embedding hierarchical emotion labels and multimodal samples into a Poincaré ball enables a hierarchical beam-search process that retrieves evidence from coarse to fine-grained levels; the resulting evidence graph is then injected into the LLM via Tree-Aware Attention and an EmotionGraphFormer, preserving graph structure and yielding higher performance than flat-label baselines.

What carries the argument

Hierarchical hyperbolic grounding: the joint embedding of the emotion taxonomy tree and input samples into the Poincaré ball, together with the beam-search deliberation that moves from broad to specific categories.

If this is right

Fine-grained emotion categories become distinguishable by following the established psychological hierarchy rather than guessing among flat labels.
Structured external evidence reduces the impact of noisy or ambiguous multimodal cues during inference.
The evidence graph and graph-aware injection layers allow the language model to use relational knowledge without flattening it into text.
Performance gains appear consistently across multiple standard multimodal emotion datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coarse-to-fine hyperbolic retrieval could be tested on other label hierarchies such as medical symptom trees or product category taxonomies.
Dynamic updates to the emotion knowledge base would allow the system to track evolving psychological classifications without retraining the entire model.
Pairing the framework with larger multimodal foundation models might produce more coherent affective reasoning in open-ended dialogue settings.

Load-bearing premise

Emotion taxonomies possess a stable hierarchical tree structure that can be embedded jointly with multimodal data in hyperbolic space so that progressive retrieval improves classification.

What would settle it

Replacing the Poincaré-ball embedding and hierarchical beam search with standard Euclidean retrieval on the same datasets and finding no measurable drop in fine-grained accuracy would falsify the central mechanism.

Figures

Figures reproduced from arXiv: 2605.18884 by Bo Zhao, Hui Ma, Qianyu Xie, Ruixin Zhang, Shouhong Ding, Yijie Zhu, Zeheng Wang, Zhishu Liu, Zitong Yu.

**Figure 2.** Figure 2: The overall framework of HyperEmo-RAG. Multimodal features (visual, acoustic, and language) are [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Parameter sensitivity analysis of HyperEmo-RAG with respect to the retrieval top-k, the number of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Multimodal emotion recognition aims to integrate text, audio, and video sources to understand human affective states. Although multimodal large language models excel at multimodal reasoning, they typically treat emotion categories as independent labels, ignoring the rich hierarchical taxonomy of human psychology. Moreover, lacking external contextual knowledge makes them highly susceptible to over-interpreting noisy cues, further complicating fine-grained emotion classification. To address these issues, we propose \textbf{HyperEmo-RAG}, a retrieval-augmented generation framework that leverages a structured emotional knowledge base. Our framework introduces two key innovations. 1) Hierarchical hyperbolic grounding. Recognizing the inherent hierarchical tree structure of emotion taxonomies, we jointly embed hierarchical emotion labels and multimodal samples into a continuous hyperbolic space (Poincar\'e ball) and design a hierarchical beam-search deliberation process that progressively retrieves samples from coarse to fine-grained levels. 2) Structured evidence injection. Based on the retrieved evidence, we construct an evidence graph and inject the structured knowledge as explicit cognitive context into the LLM through a Tree-Aware Attention mechanism and an EmotionGraphFormer, preserving the integrity of graph-structured information. Experiments on multiple datasets demonstrate that HyperEmo-RAG significantly outperforms existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HyperEmo-RAG pairs hyperbolic embeddings with RAG and graph injection for emotion hierarchies, but the hierarchy preservation looks underspecified.

read the letter

The main point is that this paper introduces HyperEmo-RAG to handle the tree structure of emotion categories in multimodal recognition by embedding both taxonomies and inputs in a Poincaré ball, then using hierarchical beam search for retrieval and feeding the results through Tree-Aware Attention plus EmotionGraphFormer into an LLM. The new element is the concrete combination of coarse-to-fine hyperbolic retrieval with structured graph injection rather than flat RAG or standard contrastive setups. It does a reasonable job identifying that most models treat emotions as independent labels and miss external knowledge, which can lead to over-interpretation of noisy multimodal cues. The approach also tries to respect psychological taxonomies instead of forcing flat classification. The soft spot is exactly the one raised in the stress test: the abstract describes joint embedding and beam search but shows no auxiliary loss or regularizer that would make parent-child distances smaller than sibling distances under the hyperbolic metric. Standard multimodal losses do not guarantee this, so the progressive retrieval could reduce to ordinary nearest-neighbor search and the claimed gains might come from the RAG or graph components instead. The outperformance statement on multiple datasets is there, yet the abstract supplies no baselines, metrics, or ablations, which makes it impossible to judge the size or source of any improvement. The full paper presumably contains these, but they are essential. This work is aimed at researchers in affective computing and multimodal learning who already care about hierarchical structure or external knowledge bases. A reader working on hyperbolic geometry or retrieval methods would get the most out of it. It deserves peer review because the idea is specific enough to check against real experiments and the training objective, even if the hierarchy enforcement needs tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes HyperEmo-RAG, a retrieval-augmented generation framework for multimodal emotion recognition. It introduces hierarchical hyperbolic grounding by jointly embedding emotion taxonomies and multimodal samples into a Poincaré ball, followed by a hierarchical beam-search process for progressive retrieval from coarse to fine-grained levels. Structured evidence is injected via an evidence graph, Tree-Aware Attention, and EmotionGraphFormer into an LLM. The paper claims that experiments on multiple datasets show significant outperformance over existing methods.

Significance. If the empirical results hold and the hierarchical structure is effectively preserved in the hyperbolic space, this work could contribute to improving fine-grained multimodal emotion recognition by leveraging external hierarchical knowledge bases and hyperbolic geometry to model emotion taxonomies, potentially reducing over-interpretation of noisy cues in MLLMs.

major comments (2)

[Hierarchical Hyperbolic Grounding] The description of jointly embedding hierarchical emotion labels and multimodal samples into the Poincaré ball does not specify any auxiliary loss or regularizer to enforce hierarchy preservation (e.g., smaller hyperbolic distances for parent-child pairs than for siblings). Standard multimodal contrastive losses may not suffice to ensure the geometry encodes the tree structure, which is load-bearing for the claimed effectiveness of the hierarchical beam-search deliberation process.
[Experiments] The abstract states that HyperEmo-RAG significantly outperforms existing methods on multiple datasets, but the provided text lacks details on the specific datasets, baselines, evaluation metrics, ablation studies, or statistical significance tests. This makes it difficult to assess whether the data support the central claim of improvement due to the proposed innovations.

minor comments (1)

[Abstract] The abstract mentions 'EmotionGraphFormer' without prior definition or reference; a brief explanation or citation would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to improve clarity and completeness.

read point-by-point responses

Referee: [Hierarchical Hyperbolic Grounding] The description of jointly embedding hierarchical emotion labels and multimodal samples into the Poincaré ball does not specify any auxiliary loss or regularizer to enforce hierarchy preservation (e.g., smaller hyperbolic distances for parent-child pairs than for siblings). Standard multimodal contrastive losses may not suffice to ensure the geometry encodes the tree structure, which is load-bearing for the claimed effectiveness of the hierarchical beam-search deliberation process.

Authors: We agree that the current description of the hierarchical hyperbolic grounding lacks explicit detail on the mechanisms used to preserve the emotion taxonomy structure. The joint embedding combines a multimodal contrastive objective with hyperbolic distance terms that penalize violations of parent-child proximity relative to sibling or unrelated pairs; however, this was not sufficiently elaborated. In the revised manuscript, we will add a dedicated subsection detailing the full embedding loss, including the specific regularizer that enforces smaller Poincaré distances for hierarchical relations, along with its weighting and integration with the beam-search process. revision: yes
Referee: [Experiments] The abstract states that HyperEmo-RAG significantly outperforms existing methods on multiple datasets, but the provided text lacks details on the specific datasets, baselines, evaluation metrics, ablation studies, or statistical significance tests. This makes it difficult to assess whether the data support the central claim of improvement due to the proposed innovations.

Authors: We acknowledge that the abstract and experimental presentation would benefit from greater specificity to allow readers to fully evaluate the claims. The full manuscript contains results across standard multimodal emotion datasets with comparisons to recent baselines and ablations, but these elements were not summarized at a sufficient level of detail. In the revision, we will update the abstract to name the primary datasets and metrics, and we will expand the experiments section to include comprehensive tables for baselines, ablation variants, evaluation protocols, and statistical significance testing (e.g., paired t-tests or Wilcoxon tests with p-values). revision: yes

Circularity Check

0 steps flagged

No circularity: framework uses standard hyperbolic embedding and RAG without self-referential reduction

full rationale

The paper's central claims rest on a proposed architecture (hierarchical hyperbolic grounding into Poincaré ball plus Tree-Aware Attention) whose description in the abstract does not reduce any prediction or result to a fitted parameter or self-citation by construction. No equations are supplied that would make the beam-search output equivalent to its inputs, nor is a uniqueness theorem imported from the authors' prior work. Standard multimodal contrastive losses are not claimed to automatically enforce tree structure; the hierarchy is presented as an explicit design choice. Experiments are cited as external validation. This is a normal, self-contained proposal of a new retrieval-augmented pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that emotion taxonomies possess a stable hierarchical tree structure that hyperbolic space can faithfully represent, plus the introduction of two new components (Tree-Aware Attention and EmotionGraphFormer) whose independent utility is not evidenced in the abstract.

axioms (1)

domain assumption Emotion taxonomies possess an inherent hierarchical tree structure.
Directly invoked to justify the hierarchical hyperbolic grounding and beam-search process.

invented entities (1)

EmotionGraphFormer no independent evidence
purpose: Preserve graph-structured information when injecting retrieved evidence into the LLM.
New module introduced as part of structured evidence injection; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5769 in / 1243 out tokens · 63009 ms · 2026-05-20T15:51:20.005824+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost costAlphaLog_high_calibrated_iff echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

jointly embed hierarchical emotion labels and multimodal samples into a continuous hyperbolic space (Poincaré ball) and design a hierarchical beam-search deliberation process that progressively retrieves samples from coarse to fine-grained levels
IndisputableMonolith/Foundation/BranchSelection branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

tree-distance-aware multimodal contrastive loss to preserve semantic relations among emotion categories, and a path consistency loss to regularize the alignment between multimodal queries and selected fine-grained emotion prototypes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 5 internal anchors

[1]

Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, and Mahdieh Baghshah. 2025. Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: ACL 2025

work page 2025
[2]

Jinze Bai, Shuai Bai, Yunfei Chu, et al. 2023. Qwen Technical Report. arXiv:2309.16609 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Co- hen. 2022. MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5558–5570

work page 2022
[4]

Hauptmann

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander G. Hauptmann. 2024. Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning. InAdvances in Neural Information Processing Sys- tems

work page 2024
[5]

Zebang Cheng, Yuxiang Lin, Zhiqi Chen, Xiang Li, Shuyi Mao, Fan Zhang, Dongdong Ding, Bowen Zhang, and Xiaojiang Peng. 2023. Semi- supervised multimodal emotion recognition with expression MAE. In Proceedings of the 31st ACM International Conference on Multimedia. 9436–9440

work page 2023
[6]

Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G Taylor. 2001. Emotion recognition in human-computer interaction.IEEE Signal Processing Magazine18, 1 (2001), 32–80

work page 2001
[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

work page
[8]

InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies

BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies. 4171–4186

work page 2019
[9]

Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases.Pattern Recognition44, 3 (2011), 572–587

work page 2011
[10]

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, et al. 2024. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

J. Guo, J. Tang, W. Dai, Y. Ding, and W. Kong. 2022. Dynamically Adjust Word Representations Using Unaligned Multimodal Information. In Proceedings of the 30th ACM International Conference on Multimedia. 3394–3402

work page 2022
[12]

Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for mul- timodal sentiment analysis. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9180–9192

work page 2021
[13]

Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and-Specific Representations for Multimodal Sentiment Analysis. InProceedings of the 28th ACM International Con- ference on Multimedia. 1122–1131

work page 2020
[14]

D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo. 2022. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. InICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7037–7041

work page 2022
[15]

Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. 2022. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 7837–7851

work page 2022
[16]

J. Hu, Y. Liu, J. Zhao, and Q. Jin. 2021. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conver- sation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

work page 2021
[17]

Ross, Cordelia Schmid, and Karteek Alahari

Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David A. Ross, Cordelia Schmid, and Karteek Alahari. 2023. REVEAL: Retrieval- Augmented Visual-Language Pre-Training with Multi-Source Multi- modal Knowledge Memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 23369–23379

work page 2023
[18]

Ao Jia, Yu He, Yazhou Zhang, Sagar Uprety, Dawei Song, and Christina Lioma. 2022. Beyond emotion: A multi-modal dataset for human desire understanding. InProceedings of the 2022 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies. 1512–1522

work page 2022
[19]

Xu, Luyu Gao, Zhiqing Sun, Qiao Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qiao Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active Retrieval Augmented Generation. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing (EMNLP). 7969–7992

work page 2023
[20]

Shanglin Lei, Guanting Dong, Xiaoping Wang, Keheng Wang, and Sirui Wang. 2023. InstructERC: Reforming emotion recognition in conver- sation with a retrieval multi-task LLMs framework.arXiv preprint arXiv:2309.11911(2023)

work page arXiv 2023
[21]

J. Li, X. Wang, G. Lv, and Z. Zeng. 2023. GA2MIF: Graph and Attention Based Two-Stage Multi-Source Information Fusion for Conversational Emotion Detection. (2023)

work page 2023
[22]

Shan Li and Weihong Deng. 2022. Deep facial expression recognition: A survey.IEEE Transactions on Affective Computing13, 3 (2022), 1195– 1215

work page 2022
[23]

Zaijing Li, Ting-En Lin, Yuchuan Wu, Ming Liu, Fengxiao Tang, Ming Zhao, and Yongbin Li. 2023. UniSA: Unified Generative Framework for Sentiment Analysis. (2023). arXiv:2309.01339

work page arXiv 2023
[24]

Zaijing Li, Fengxiao Tang, Ming Zhao, and Yusen Zhu. 2022. EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition. InFindings of the Association for Computational Linguistics: ACL 2022. 1610–1618

work page 2022
[25]

Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, and Jianhua Tao. 2025. AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models. InProceedings of the 42nd International Conference on Machine Learning

work page 2025
[26]

Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqiuyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, and Kai Gao. 2022. Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module. InProceedings of the 2022 International Conference on Multimodal Interaction (ICMI). ACM, 247–258. doi:10.1145/3536221. 3556569

work page doi:10.1145/3536221 2022
[27]

Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low- rank multimodal fusion with modality-specific factors. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2247–2256

work page 2018
[28]

Zhishu Liu, Kaishen Yuan, Bo Zhao, Hui Ma, and Zitong Yu. 2026. AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition. arXiv:2603.08387 [cs.CV] https://arxiv. org/abs/2603.08387

work page arXiv 2026
[29]

Zhishu Liu, Kaishen Yuan, Bo Zhao, Yong Xu, and Zitong Yu. 2025. Au-llm: Micro-expression action unit detection via enhanced llm-based feature fusion. InChinese Conference on Biometric Recognition. Springer, 355–365

work page 2025
[30]

F. Lv, X. Chen, Y. Huang, L. Duan, and G. Lin. 2021. Progressive Modal- ity Reinforcement for Human Multimodal Emotion Recognition from Unaligned Multimodal Sequences. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. 2554–2562. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

work page 2021
[31]

Rosalind W Picard, Elias Vyzas, and Jennifer Healey. 2001. Toward machine emotional intelligence: Analysis of affective physiological state.IEEE Transactions on Pattern Analysis and Machine Intelligence23, 10 (2001), 1175–1191

work page 2001
[32]

Vittorio Pipoli, Alessia Saporita, Federico Bolelli, Marcella Cornia, Lorenzo Baraldi, Costantino Grana, Rita Cucchiara, and Elisa Ficarra

work page
[33]

InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

MISSRAG: Addressing the Missing Modality Challenge in Multi- modal Large Language Models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

work page
[34]

Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal emotion recognition.Information Fusion37 (2017), 98–125

work page 2017
[35]

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. InPro- ceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics. Association for Computational Linguistics, 527–536. doi:10.18653/v1/P19-1050

work page doi:10.18653/v1/p19-1050 2019
[36]

Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque

work page
[37]

InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Integrating multimodal information in large pretrained trans- formers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2359–2369

work page
[38]

Björn W Schuller. 2018. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends.Commun. ACM61, 5 (2018), 90–99

work page 2018
[39]

Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih- Fu Chang, and Maja Pantic. 2017. A survey of multimodal emotion recognition.Image and Vision Computing65 (2017), 1–14

work page 2017
[40]

Jun Sun, Shoukang Han, Yu-Ping Ruan, Xiaoning Zhang, Shu-Kai Zheng, Yulong Liu, Yuxin Huang, and Taihao Li. 2023. Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recogni- tion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Com- putational Lin...

work page doi:10.18653/v1/2023.acl-long.39 2023
[41]

J. Sun, S. Han, Y.-P. Ruan, X. Zhang, S.-K. Zheng, Y. Liu, Y. Huang, and T. Li. 2023. Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recognition. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 658–670

work page 2023
[42]

Hugo Touvron, Louis Martin, Kevin Stone, et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Zico Kolter, Louis- Philippe Morency, and Ruslan Salakhutdinov

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis- Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Trans- former for Unaligned Multimodal Language Sequences. InProceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics

work page 2019
[44]

Ma, Muqiao Yang, Ruslan Salakhutdi- nov, and Louis-Philippe Morency

Yao-Hung Hubert Tsai, Martin Q. Ma, Muqiao Yang, Ruslan Salakhutdi- nov, and Louis-Philippe Morency. 2020. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing

work page 2020
[45]

Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao

work page
[46]

Region attention networks for pose and occlusion robust facial expression recognition.IEEE Transactions on Image Processing29 (2020), 4057–4069

work page 2020
[47]

Taorui Wang, Xun Lin, Yong Xu, Qilang Ye, Dan Guo, Sergio Escalera, Ghada Khoriba, and Zitong Yu. 2026. Micro-gesture Recognition: A Comprehensive Survey of Datasets, Methods, and Challenges.Machine Intelligence Research23, 2 (2026), 308–330. doi:10.1007/s11633-025-1629- x

work page doi:10.1007/s11633-025-1629- 2026
[48]

Taorui Wang, Zitong Yu, and Yong Xu. 2025. TC-GS: Tri-plane based compression for 3D Gaussian Splatting. arXiv:2503.20221 [cs.CV] https: //arxiv.org/abs/2503.20221

work page arXiv 2025
[49]

Zeheng Wang, Zitong Yu, Yijie Zhu, Bo Zhao, Haochen Liang, Taorui Wang, Wei Xia, Jiayu Zhang, Zhishu Liu, Hui Ma, Fei Ma, and Qi Tian. 2026. AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition.arXiv preprint arXiv:2604.12735(2026). arXiv:2604.12735 https://arxiv.org/abs/2604. 12735

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Zhuofan Wen, Zheng Lian, Shun Chen, Hailiang Yao, Longjiang Yang, Bin Liu, and Jianhua Tao. 2025. Listen, Watch, and Learn to Feel: Retrieval-Augmented Emotion Reasoning for Compound Emotion Gen- eration. InFindings of the Association for Computational Linguistics: ACL 2025

work page 2025
[51]

Peng Xia, Guilin Qi, et al . 2024. Multimodal Retrieval-Augmented Generation: A Survey.arXiv preprint arXiv:2402.03573(2024)

work page arXiv 2024
[52]

Ye, and B

Qu Yang, M. Ye, and B. Du. 2024. EmoLLM: Multimodal Emo- tional Understanding Meets Large Language Models.arXiv preprint arXiv:2406.16442(2024)

work page arXiv 2024
[53]

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Richard James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen- tau Yih. 2023. Retrieval-Augmented Multimodal Language Modeling. InProceedings of the 40th International Conference on Machine Learning (ICML). 39755–39769

work page 2023
[54]

W. Yu, H. Xu, Z. Yuan, and J. Wu. 2021. Learning Modality-Specific Rep- resentations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10790–10797

work page 2021
[55]

Zheng Yuan, Haotian Liu, Chunyuan Li, et al. 2025. A Survey on Multi- modal Retrieval-Augmented Generation.arXiv preprint arXiv:2504.08748 (2025)

work page arXiv 2025
[56]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. 2017. Tensor fusion network for multimodal sen- timent analysis. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1103–1114

work page 2017
[57]

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. 2018. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Lin- guistics, 223...

work page doi:10.18653/v1/p18-1208 2018
[58]

AmirAli Bagher Zadeh, Paul Pu Liang, Jonathan Vanbriesen, Soujanya Poria, Edmund Tong, Erik Cambria, Minghai Chen, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2236–2246

work page 2018
[59]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understand- ing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations. 443–453

work page 2023
[60]

Sitao Zhang, Yimu Pan, and James Z. Wang. 2023. Learning emotion rep- resentations from verbal and nonverbal communication. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18993–19004

work page 2023
[61]

Zengqun Zhao, Qingshan Liu, and Feng Zhou. 2022. Robust facial expression recognition: A survey.IEEE Transactions on Affective Com- puting13, 4 (2022), 1805–1823

work page 2022
[62]

Ziwang Zheng, Tangli Jiao, Zebang Cheng, et al. 2023. EmoLLaVA: A Large Vision-Language Model for Emotion Recognition.arXiv preprint arXiv:2312.14415(2023)

work page arXiv 2023
[63]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elho- seiny. 2024. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. InInternational Conference on Learning Representations

work page 2024
[64]

Yijie Zhu, Yibo Lyu, Zitong Yu, Rui Shao, Kaiyang Zhou, and Liqiang Nie. 2025. EmoSym: A Symbiotic Framework for Unified Emotional Understanding and Generation via Latent Reasoning. InProceedings of the 33nd ACM International Conference on Multimedia

work page 2025
[65]

Yijie Zhu, Lingsen Zhang, Zitong Yu, Rui Shao, Tao Tan, and Liqiang Nie. 2025. UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries.arXiv preprint arXiv:2507.23372(2025)

work page internal anchor Pith review arXiv 2025

[1] [1]

Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, and Mahdieh Baghshah. 2025. Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: ACL 2025

work page 2025

[2] [2]

Jinze Bai, Shuai Bai, Yunfei Chu, et al. 2023. Qwen Technical Report. arXiv:2309.16609 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Co- hen. 2022. MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5558–5570

work page 2022

[4] [4]

Hauptmann

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander G. Hauptmann. 2024. Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning. InAdvances in Neural Information Processing Sys- tems

work page 2024

[5] [5]

Zebang Cheng, Yuxiang Lin, Zhiqi Chen, Xiang Li, Shuyi Mao, Fan Zhang, Dongdong Ding, Bowen Zhang, and Xiaojiang Peng. 2023. Semi- supervised multimodal emotion recognition with expression MAE. In Proceedings of the 31st ACM International Conference on Multimedia. 9436–9440

work page 2023

[6] [6]

Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G Taylor. 2001. Emotion recognition in human-computer interaction.IEEE Signal Processing Magazine18, 1 (2001), 32–80

work page 2001

[7] [7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

work page

[8] [8]

InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies

BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies. 4171–4186

work page 2019

[9] [9]

Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases.Pattern Recognition44, 3 (2011), 572–587

work page 2011

[10] [10]

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, et al. 2024. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

J. Guo, J. Tang, W. Dai, Y. Ding, and W. Kong. 2022. Dynamically Adjust Word Representations Using Unaligned Multimodal Information. In Proceedings of the 30th ACM International Conference on Multimedia. 3394–3402

work page 2022

[12] [12]

Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for mul- timodal sentiment analysis. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9180–9192

work page 2021

[13] [13]

Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and-Specific Representations for Multimodal Sentiment Analysis. InProceedings of the 28th ACM International Con- ference on Multimedia. 1122–1131

work page 2020

[14] [14]

D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo. 2022. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. InICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7037–7041

work page 2022

[15] [15]

Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. 2022. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 7837–7851

work page 2022

[16] [16]

J. Hu, Y. Liu, J. Zhao, and Q. Jin. 2021. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conver- sation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

work page 2021

[17] [17]

Ross, Cordelia Schmid, and Karteek Alahari

Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David A. Ross, Cordelia Schmid, and Karteek Alahari. 2023. REVEAL: Retrieval- Augmented Visual-Language Pre-Training with Multi-Source Multi- modal Knowledge Memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 23369–23379

work page 2023

[18] [18]

Ao Jia, Yu He, Yazhou Zhang, Sagar Uprety, Dawei Song, and Christina Lioma. 2022. Beyond emotion: A multi-modal dataset for human desire understanding. InProceedings of the 2022 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies. 1512–1522

work page 2022

[19] [19]

Xu, Luyu Gao, Zhiqing Sun, Qiao Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qiao Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active Retrieval Augmented Generation. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing (EMNLP). 7969–7992

work page 2023

[20] [20]

Shanglin Lei, Guanting Dong, Xiaoping Wang, Keheng Wang, and Sirui Wang. 2023. InstructERC: Reforming emotion recognition in conver- sation with a retrieval multi-task LLMs framework.arXiv preprint arXiv:2309.11911(2023)

work page arXiv 2023

[21] [21]

J. Li, X. Wang, G. Lv, and Z. Zeng. 2023. GA2MIF: Graph and Attention Based Two-Stage Multi-Source Information Fusion for Conversational Emotion Detection. (2023)

work page 2023

[22] [22]

Shan Li and Weihong Deng. 2022. Deep facial expression recognition: A survey.IEEE Transactions on Affective Computing13, 3 (2022), 1195– 1215

work page 2022

[23] [23]

Zaijing Li, Ting-En Lin, Yuchuan Wu, Ming Liu, Fengxiao Tang, Ming Zhao, and Yongbin Li. 2023. UniSA: Unified Generative Framework for Sentiment Analysis. (2023). arXiv:2309.01339

work page arXiv 2023

[24] [24]

Zaijing Li, Fengxiao Tang, Ming Zhao, and Yusen Zhu. 2022. EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition. InFindings of the Association for Computational Linguistics: ACL 2022. 1610–1618

work page 2022

[25] [25]

Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, and Jianhua Tao. 2025. AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models. InProceedings of the 42nd International Conference on Machine Learning

work page 2025

[26] [26]

Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqiuyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, and Kai Gao. 2022. Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module. InProceedings of the 2022 International Conference on Multimodal Interaction (ICMI). ACM, 247–258. doi:10.1145/3536221. 3556569

work page doi:10.1145/3536221 2022

[27] [27]

Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low- rank multimodal fusion with modality-specific factors. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2247–2256

work page 2018

[28] [28]

Zhishu Liu, Kaishen Yuan, Bo Zhao, Hui Ma, and Zitong Yu. 2026. AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition. arXiv:2603.08387 [cs.CV] https://arxiv. org/abs/2603.08387

work page arXiv 2026

[29] [29]

Zhishu Liu, Kaishen Yuan, Bo Zhao, Yong Xu, and Zitong Yu. 2025. Au-llm: Micro-expression action unit detection via enhanced llm-based feature fusion. InChinese Conference on Biometric Recognition. Springer, 355–365

work page 2025

[30] [30]

F. Lv, X. Chen, Y. Huang, L. Duan, and G. Lin. 2021. Progressive Modal- ity Reinforcement for Human Multimodal Emotion Recognition from Unaligned Multimodal Sequences. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. 2554–2562. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

work page 2021

[31] [31]

Rosalind W Picard, Elias Vyzas, and Jennifer Healey. 2001. Toward machine emotional intelligence: Analysis of affective physiological state.IEEE Transactions on Pattern Analysis and Machine Intelligence23, 10 (2001), 1175–1191

work page 2001

[32] [32]

Vittorio Pipoli, Alessia Saporita, Federico Bolelli, Marcella Cornia, Lorenzo Baraldi, Costantino Grana, Rita Cucchiara, and Elisa Ficarra

work page

[33] [33]

InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

MISSRAG: Addressing the Missing Modality Challenge in Multi- modal Large Language Models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

work page

[34] [34]

Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal emotion recognition.Information Fusion37 (2017), 98–125

work page 2017

[35] [35]

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. InPro- ceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics. Association for Computational Linguistics, 527–536. doi:10.18653/v1/P19-1050

work page doi:10.18653/v1/p19-1050 2019

[36] [36]

Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque

work page

[37] [37]

InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Integrating multimodal information in large pretrained trans- formers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2359–2369

work page

[38] [38]

Björn W Schuller. 2018. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends.Commun. ACM61, 5 (2018), 90–99

work page 2018

[39] [39]

Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih- Fu Chang, and Maja Pantic. 2017. A survey of multimodal emotion recognition.Image and Vision Computing65 (2017), 1–14

work page 2017

[40] [40]

Jun Sun, Shoukang Han, Yu-Ping Ruan, Xiaoning Zhang, Shu-Kai Zheng, Yulong Liu, Yuxin Huang, and Taihao Li. 2023. Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recogni- tion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Com- putational Lin...

work page doi:10.18653/v1/2023.acl-long.39 2023

[41] [41]

J. Sun, S. Han, Y.-P. Ruan, X. Zhang, S.-K. Zheng, Y. Liu, Y. Huang, and T. Li. 2023. Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recognition. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 658–670

work page 2023

[42] [42]

Hugo Touvron, Louis Martin, Kevin Stone, et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Zico Kolter, Louis- Philippe Morency, and Ruslan Salakhutdinov

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis- Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Trans- former for Unaligned Multimodal Language Sequences. InProceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics

work page 2019

[44] [44]

Ma, Muqiao Yang, Ruslan Salakhutdi- nov, and Louis-Philippe Morency

Yao-Hung Hubert Tsai, Martin Q. Ma, Muqiao Yang, Ruslan Salakhutdi- nov, and Louis-Philippe Morency. 2020. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing

work page 2020

[45] [45]

Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao

work page

[46] [46]

Region attention networks for pose and occlusion robust facial expression recognition.IEEE Transactions on Image Processing29 (2020), 4057–4069

work page 2020

[47] [47]

Taorui Wang, Xun Lin, Yong Xu, Qilang Ye, Dan Guo, Sergio Escalera, Ghada Khoriba, and Zitong Yu. 2026. Micro-gesture Recognition: A Comprehensive Survey of Datasets, Methods, and Challenges.Machine Intelligence Research23, 2 (2026), 308–330. doi:10.1007/s11633-025-1629- x

work page doi:10.1007/s11633-025-1629- 2026

[48] [48]

Taorui Wang, Zitong Yu, and Yong Xu. 2025. TC-GS: Tri-plane based compression for 3D Gaussian Splatting. arXiv:2503.20221 [cs.CV] https: //arxiv.org/abs/2503.20221

work page arXiv 2025

[49] [49]

Zeheng Wang, Zitong Yu, Yijie Zhu, Bo Zhao, Haochen Liang, Taorui Wang, Wei Xia, Jiayu Zhang, Zhishu Liu, Hui Ma, Fei Ma, and Qi Tian. 2026. AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition.arXiv preprint arXiv:2604.12735(2026). arXiv:2604.12735 https://arxiv.org/abs/2604. 12735

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

Zhuofan Wen, Zheng Lian, Shun Chen, Hailiang Yao, Longjiang Yang, Bin Liu, and Jianhua Tao. 2025. Listen, Watch, and Learn to Feel: Retrieval-Augmented Emotion Reasoning for Compound Emotion Gen- eration. InFindings of the Association for Computational Linguistics: ACL 2025

work page 2025

[51] [51]

Peng Xia, Guilin Qi, et al . 2024. Multimodal Retrieval-Augmented Generation: A Survey.arXiv preprint arXiv:2402.03573(2024)

work page arXiv 2024

[52] [52]

Ye, and B

Qu Yang, M. Ye, and B. Du. 2024. EmoLLM: Multimodal Emo- tional Understanding Meets Large Language Models.arXiv preprint arXiv:2406.16442(2024)

work page arXiv 2024

[53] [53]

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Richard James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen- tau Yih. 2023. Retrieval-Augmented Multimodal Language Modeling. InProceedings of the 40th International Conference on Machine Learning (ICML). 39755–39769

work page 2023

[54] [54]

W. Yu, H. Xu, Z. Yuan, and J. Wu. 2021. Learning Modality-Specific Rep- resentations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10790–10797

work page 2021

[55] [55]

Zheng Yuan, Haotian Liu, Chunyuan Li, et al. 2025. A Survey on Multi- modal Retrieval-Augmented Generation.arXiv preprint arXiv:2504.08748 (2025)

work page arXiv 2025

[56] [56]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. 2017. Tensor fusion network for multimodal sen- timent analysis. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1103–1114

work page 2017

[57] [57]

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. 2018. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Lin- guistics, 223...

work page doi:10.18653/v1/p18-1208 2018

[58] [58]

AmirAli Bagher Zadeh, Paul Pu Liang, Jonathan Vanbriesen, Soujanya Poria, Edmund Tong, Erik Cambria, Minghai Chen, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2236–2246

work page 2018

[59] [59]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understand- ing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations. 443–453

work page 2023

[60] [60]

Sitao Zhang, Yimu Pan, and James Z. Wang. 2023. Learning emotion rep- resentations from verbal and nonverbal communication. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18993–19004

work page 2023

[61] [61]

Zengqun Zhao, Qingshan Liu, and Feng Zhou. 2022. Robust facial expression recognition: A survey.IEEE Transactions on Affective Com- puting13, 4 (2022), 1805–1823

work page 2022

[62] [62]

Ziwang Zheng, Tangli Jiao, Zebang Cheng, et al. 2023. EmoLLaVA: A Large Vision-Language Model for Emotion Recognition.arXiv preprint arXiv:2312.14415(2023)

work page arXiv 2023

[63] [63]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elho- seiny. 2024. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. InInternational Conference on Learning Representations

work page 2024

[64] [64]

Yijie Zhu, Yibo Lyu, Zitong Yu, Rui Shao, Kaiyang Zhou, and Liqiang Nie. 2025. EmoSym: A Symbiotic Framework for Unified Emotional Understanding and Generation via Latent Reasoning. InProceedings of the 33nd ACM International Conference on Multimedia

work page 2025

[65] [65]

Yijie Zhu, Lingsen Zhang, Zitong Yu, Rui Shao, Tao Tan, and Liqiang Nie. 2025. UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries.arXiv preprint arXiv:2507.23372(2025)

work page internal anchor Pith review arXiv 2025