arxiv: 2603.02123 · v3 · submitted 2026-03-02 · 💻 cs.AI · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

Jiahao Huang , Fengyan Lin , Xuechao Yang , Chen Feng , Kexin Zhu , Xu Yang , Zhide Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords multimodal language modelemotional intelligenceaffective computingperception to empathycurriculum learningthree-level hierarchyomni-modal encoderscompact model

0 comments

The pith

A 2.2 billion parameter multimodal model unifies six emotional tasks from raw perception to empathetic interaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that affective capabilities in multimodal models have been fragmented because perception and high-level interaction sit in separate silos. It introduces a three-level cognitive hierarchy—perception, understanding, and interaction—to organize tasks by depth and to guide both architecture and training. Guided by this structure, the authors build Nano-EmoX, a compact 2.2B model that fuses omni-modal encoders with a lightweight language backbone via adapters. They train it with a curriculum called P2E that progressively aligns fast perceptual cues with chain-of-thought empathy. The result is a single small model that handles all six core affective tasks at competitive or state-of-the-art levels while showing strong cross-task generalization.

Core claim

Nano-EmoX is the first compact multimodal language model of 2.2B parameters that unifies six core affective tasks across the full three-level hierarchy of perception, understanding, and interaction. It achieves this by combining enhanced facial and fusion encoders, projecting their outputs into a shared language space through heterogeneous adapters, and training the system with the P2E curriculum that progressively links rapid perception to chain-of-thought-driven empathy, yielding state-of-the-art or highly competitive results on multiple benchmarks together with clear efficiency and generalization gains.

What carries the argument

A cognitively inspired three-level hierarchy (perception, understanding, interaction) that organizes affective tasks by cognitive depth and directs both the model architecture and the P2E progressive training curriculum.

If this is right

A single compact model can now perform perception, understanding, and interaction tasks without the fragmentation that previously required separate specialized systems.
Progressive curriculum alignment from perceptual encoders to chain-of-thought reasoning improves transfer across emotional benchmarks.
Heterogeneous adapters allow omni-modal cues to be injected into a lightweight language model while keeping total size at 2.2B parameters.
The same hierarchy can be reused to add new affective tasks without retraining the entire model from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hierarchy may generalize beyond emotion to other cognitive domains where low-level sensory input must be lifted to deliberative interaction.
Because the model stays small, the approach could enable on-device emotional intelligence in robots or mobile apps where larger models are impractical.
Future work could test whether inserting explicit uncertainty estimates at each hierarchy level further improves robustness on ambiguous emotional inputs.

Load-bearing premise

The three-level cognitive hierarchy accurately captures how emotional intelligence builds from perception to empathy and therefore supplies an effective blueprint for model design and training that produces cross-task transfer.

What would settle it

Train an otherwise identical 2.2B model on the same six tasks but without the hierarchy-guided architecture or P2E curriculum and measure whether cross-task generalization and benchmark scores drop substantially below the reported Nano-EmoX results.

Figures

Figures reproduced from arXiv: 2603.02123 by Chen Feng, Fengyan Lin, Jiahao Huang, Kexin Zhu, Xuechao Yang, Xu Yang, Zhide Chen.

**Figure 2.** Figure 2: The architecture of the Nano-EmoX. The visual branch extracts general visual emotional cues, the facial branch is responsible [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The fusion encoder extracts multi-layer features from the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The P2E framework consists of a three-phase instruction fine-tuning process. Phase 1 focuses on the basic emotion recognition, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization results on the OV-MER, MIR, ERI, and ERG tasks. Our model responds rapidly when handling perceptual tasks, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation result: (a) Training the MSA task vs. Zero shot; [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 1.** Figure 1: The facial Encoder extracts multiscale facial features and [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗

**Figure 2.** Figure 2: More visualization results in ERI and ERG task. [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

read the original abstract

The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization. The code is available at https://github.com/waHAHJIAHAO/Nano-EmoX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Nano-EmoX gives a compact 2.2B model for six affective tasks via a three-level hierarchy and P2E curriculum, but the hierarchy's role in the gains is not isolated.

read the letter

The main point is a 2.2B model that folds perception, understanding, and interaction into one system for six emotional tasks, trained with a staged P2E curriculum that starts from quick perception and moves to chain-of-thought empathy. The architecture uses omni-modal encoders, an enhanced facial one, a fusion encoder, and heterogeneous adapters to push everything into a shared language space for a lightweight LM. That combination in a single small model is the concrete new piece relative to earlier fragmented setups. The code release helps anyone who wants to inspect or extend it. If the reported results on the benchmarks hold, the efficiency angle could matter for HCI or assistive applications where you need emotional handling without massive compute. The hierarchy is presented as guiding both design and training, which gives the work a clear organizing principle. The soft spot is that the manuscript does not isolate whether the three-level structure and curriculum actually drive the cross-task transfer. No ablation freezes the encoders and adapters while switching to simultaneous multitask training or random ordering, so the performance edge cannot be cleanly attributed to the cognitive framing rather than the base multimodal components. The abstract also omits the actual numbers, benchmark names, error bars, and ablation tables, which leaves the SOTA claims difficult to assess directly. This is for readers working on compact multimodal models or affective computing who want a unified small-scale system. It shows honest engagement with the task structure and supplies enough implementation detail plus code to make refereeing worthwhile. Send it to peer review so the metrics and ablations can be checked properly.

Referee Report

2 major / 2 minor

Summary. The paper proposes Nano-EmoX, a 2.2B-parameter compact multimodal language model that unifies six core affective tasks across a cognitively inspired three-level hierarchy (perception, understanding, interaction). It introduces omni-modal encoders (including an enhanced facial encoder and fusion encoder), heterogeneous adapters to project into language space, and the P2E curriculum-based training framework to progressively align perception with chain-of-thought empathy, claiming this is the first such compact MLM to achieve SOTA or highly competitive results across multiple benchmarks with good efficiency and generalization.

Significance. If the results hold and the hierarchy's contribution can be isolated, the work would be significant as an efficient, unified approach to affective multimodal modeling that bridges low-level perception and high-level interaction in a single compact model. The public code release supports reproducibility and could enable follow-up work on emotional intelligence in resource-constrained settings.

major comments (2)

[Experiments] Experiments section: The central claim attributes unification, cross-task transferability, and generalization to the three-level hierarchy plus P2E curriculum, yet no ablation is reported that removes the staged curriculum (e.g., simultaneous multitask training or random ordering) while freezing the omni-modal encoders, fusion encoder, and heterogeneous adapters. Without this isolation, performance gains on the six tasks cannot be attributed to the cognitively inspired structure rather than the underlying multimodal components.
[Results] Results section: The abstract asserts SOTA or highly competitive performance across benchmarks, but the provided text supplies no quantitative metrics, specific benchmark names, error bars, baseline comparisons, or ablation tables. These details are load-bearing for substantiating the efficiency and generalization claims and must be presented with full tables and statistical analysis.

minor comments (2)

[Abstract] Abstract: The phrase 'to the best of our knowledge' is standard but could be strengthened by briefly noting the exact six tasks and the three hierarchy levels for immediate clarity.
[Method] Notation: The description of 'heterogeneous adapters' and 'omni-modal encoders' would benefit from a single diagram or table summarizing input modalities and projection dimensions to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation of our contributions. We address each major comment below and commit to revisions that strengthen the manuscript without misrepresenting our work.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim attributes unification, cross-task transferability, and generalization to the three-level hierarchy plus P2E curriculum, yet no ablation is reported that removes the staged curriculum (e.g., simultaneous multitask training or random ordering) while freezing the omni-modal encoders, fusion encoder, and heterogeneous adapters. Without this isolation, performance gains on the six tasks cannot be attributed to the cognitively inspired structure rather than the underlying multimodal components.

Authors: We agree that the manuscript would benefit from an explicit ablation isolating the P2E curriculum's contribution. The current version does not include a direct comparison of staged curriculum training against simultaneous multitask training or random ordering with encoders and adapters held fixed. In the revised manuscript, we will add this ablation experiment to the Experiments section, reporting performance deltas on the six tasks and discussing how the results support attribution to the curriculum structure. revision: yes
Referee: [Results] Results section: The abstract asserts SOTA or highly competitive performance across benchmarks, but the provided text supplies no quantitative metrics, specific benchmark names, error bars, baseline comparisons, or ablation tables. These details are load-bearing for substantiating the efficiency and generalization claims and must be presented with full tables and statistical analysis.

Authors: We acknowledge that the version reviewed did not present the full quantitative details in a readily accessible form. The manuscript contains benchmark results, but to address this concern we will expand the Results section with complete tables listing all quantitative metrics, specific benchmark names for the six tasks, error bars, baseline comparisons, and existing ablation tables. We will also incorporate statistical analysis (e.g., significance tests) and revise the abstract to include key performance numbers supporting the SOTA/competitive claims. revision: yes

Circularity Check

0 steps flagged

No circularity: hierarchy and model presented as novel construction

full rationale

The manuscript introduces a three-level hierarchy as a guiding conceptual framework and describes Nano-EmoX plus P2E curriculum as newly constructed components whose performance is evaluated on external benchmarks. No equations, fitted parameters, or self-citations appear in the provided text that would reduce any claimed unification or generalization to a definitional equivalence or by-construction prediction. The central claims rest on architectural choices and empirical results rather than any step that collapses back to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the new three-level hierarchy and P2E framework as organizing principles; the 2.2B size is a design choice rather than a fitted parameter.

free parameters (1)

2.2B model scale
Chosen as the target size for a compact yet capable model; no fitting procedure described.

axioms (1)

domain assumption A cognitively inspired three-level hierarchy organizes affective tasks according to cognitive depth and supports unified modeling.
Invoked in the abstract to guide both the model architecture and the P2E curriculum.

invented entities (1)

P2E curriculum-based training framework no independent evidence
purpose: Progressively aligns rapid perception with chain-of-thought-driven empathy across hierarchy levels.
Newly introduced training method without external validation cited in the abstract.

pith-pipeline@v0.9.0 · 5563 in / 1378 out tokens · 65895 ms · 2026-05-15T17:38:57.928690+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth—perception, understanding, and interaction—and provides a unified conceptual foundation... P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_strictMono echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Phase 1: Foundational Modality Alignment... Phase 2: Cross-modal Fusion Pre-training... Phase 3: Multitask Instruction Tuning... shallow-to-deep cognitive progression

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 9 internal anchors

[1]

Guibas, and Sergey Tulyakov

Panos Achlioptas, Maks Ovsjanikov, Leonidas J. Guibas, and Sergey Tulyakov. Affection: Learning affective explana- tions for real-world visual data. In2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

work page
[2]

Multi-task learning for multi-modal emotion recognition and sentiment analysis

Md Shad Akhtar, Dushyant Chauhan, Deepanway Ghosal, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya. Multi-task learning for multi-modal emotion recognition and sentiment analysis. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long and Sho...

work page 2019
[3]

Multimodal lan- guage analysis in the wild: Cmu-mosei dataset and inter- pretable dynamic fusion graph

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal lan- guage analysis in the wild: Cmu-mosei dataset and inter- pretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 2236–2246, 2018. 6, 3

work page 2018
[4]

How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks)

Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). InInternational Conference on Computer Vision, 2017. 7

work page 2017
[5]

Iemo- cap: Interactive emotional dyadic motion capture database

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemo- cap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008. 6, 3

work page 2008
[6]

Speech emotion recognition with multi- task learning

Xingyu Cai, Jiahong Yuan, Renjie Zheng, Liang Huang, and Kenneth Church. Speech emotion recognition with multi- task learning. InInterspeech, pages 4508–4512. Brno, 2021. 1

work page 2021
[7]

Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014. 4, 3

work page 2014
[8]

Multivariate, multi-frequency and multimodal: Rethinking graph neural networks for emotion recognition in conversa- tion

Feiyu Chen, Jie Shao, Shuyuan Zhu, and Heng Tao Shen. Multivariate, multi-frequency and multimodal: Rethinking graph neural networks for emotion recognition in conversa- tion. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 1

work page 2023
[9]

Finecliper: Multi-modal fine-grained clip for dynamic facial expression recognition with adapters

Haodong Chen, Haojian Huang, Junhao Dong, Mingzhe Zheng, and Dian Shao. Finecliper: Multi-modal fine-grained clip for dynamic facial expression recognition with adapters. 4 Table 5. Results of the ablation study on task composition in phase 3 of P2E. This table investigates the model’s sensitivity to different proportions of training tasks. P2E Phase3 Tas...

work page 2024
[10]

Improv- ing multi-turn emotional support dialogue generation with lookahead strategy planning

Yi Cheng, Wenge Liu, Wenjie Li, Jiashuo Wang, Ruihui Zhao, Bang Liu, Xiaodan Liang, and Yefeng Zheng. Improv- ing multi-turn emotional support dialogue generation with lookahead strategy planning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, 2022. 1

work page 2022
[11]

Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.Advances in Neural Infor- mation Processing Systems, 37:110805–110853, 2024

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxi- ang Lin, Zheng Lian, Xiaojiang Peng, and Alexander Haupt- mann. Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.Advances in Neural Infor- mation Processing Systems, 37:110805–110853, 2024. 1, 2, 3, 5, 6

work page 2024
[12]

Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024. 6, 7

work page arXiv 2024
[13]

Towards multimodal emotional support con- versation systems.IEEE Transactions on Multimedia, 2025

Yuqi Chu, Lizi Liao, Zhiyuan Zhou, Chong-Wah Ngo, and Richang Hong. Towards multimodal emotional support con- versation systems.IEEE Transactions on Multimedia, 2025. 2

work page 2025
[14]

Learning aligned audiovisual representations for multimodal senti- ment analysis

Chaoyue Ding, Daoming Zong, Baoxiang Li, Ken Zheng, Dinghao Zhou, Jiakui Li, and Qunyan Zhou. Learning aligned audiovisual representations for multimodal senti- ment analysis. InProceedings of the 1st International Work- shop on Multimodal and Responsible Affective Computing, pages 21–28, 2023. 4

work page 2023
[15]

Reasoning implicit sentiment with chain-of- thought prompting

Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and Tat- Seng Chua. Reasoning implicit sentiment with chain-of- thought prompting. InProceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 2: Short Papers), 2023. 1

work page 2023
[16]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Mike Schaekermann Gheorghe Comanici, Eric Bieber et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic ca- pabilities.arXiv preprint arXiv:2507.06261, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Unimodal multi-task fusion for emo- tional mimicry intensity prediction

Tobias Hallmen, Fabian Deuser, Norbert Oswald, and Elisabeth Andr ´e. Unimodal multi-task fusion for emo- tional mimicry intensity prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4657–4665, 2024. 1

work page 2024
[18]

Towards understanding neural machine translation with word importance

Shilin He, Zhaopeng Tu, Xing Wang, Longyue Wang, Michael Lyu, and Shuming Shi. Towards understanding neural machine translation with word importance. InPro- ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 953–962. Associatio...

work page 2019
[19]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 4

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29: 3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29: 3451–3460, 2021. 5

work page 2021
[21]

Emobench-m: Benchmarking emotional intelli- gence for multimodal large language models (2025), 2025

H Hu, Y Zhou, L You, H Xu, Q Wang, Z Lian, FR Yu, F Ma, and L Cui. Emobench-m: Benchmarking emotional intelli- gence for multimodal large language models (2025), 2025. 2

work page 2025
[22]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zhen Leng Thai, Kai Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm: Un- veiling the potential of small languag...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Emotion-qwen: Training hybrid experts for unified emotion and general vision-language understanding

Dawei Huang, Qing Li, Chuan Yan, Zebang Cheng, Yurong Huang, Xiang Li, Bin Li, Xiaohui Wang, Zheng Lian, and Xiaojiang Peng. Emotion-qwen: Training hybrid experts for unified emotion and general vision-language understanding. arXiv preprint arXiv:2505.06685, 2025. 1, 2, 6

work page arXiv 2025
[24]

Human- ai interaction research agenda: A user-centered perspective

Tingting Jiang, Zhumo Sun, Shiting Fu, and Yan Lv. Human- ai interaction research agenda: A user-centered perspective. Data and Information Management, 8(4):100078, 2024. 1

work page 2024
[25]

Context-aware emotion recognition net- works

Jiyoung Lee, Seungryong Kim, Sunok Kim, Jungin Park, and Kwanghoon Sohn. Context-aware emotion recognition net- works. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 10143–10152, 2019. 4, 3

work page 2019
[26]

Emoverse: Enhancing multimodal large language models for affective computing via multitask learning.Neu- rocomputing, 650:130810, 2025

Ao Li, Longwei Xu, Chen Ling, Jinghui Zhang, and Peng- wei Wang. Emoverse: Enhancing multimodal large language models for affective computing via multitask learning.Neu- rocomputing, 650:130810, 2025. 2, 3

work page 2025
[27]

A Diversity-Promoting Objective Function for Neural Conversation Models

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective func- 5 In the text, the caption reads: "Saying 'I swear' is useless even if you're lying one hundred times." This phrase might be a woman's evaluation or response to someone. Given the speaker's's tone described in the audio cues as tense and agitated, ...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

work page 2023
[29]

Attentive to individual: A multimodal emotion recognition network with personalized attention profile

Jeng-Lin Li and Chi-Chun Lee. Attentive to individual: A multimodal emotion recognition network with personalized attention profile. InInterspeech, pages 211–215, 2019. 1 6

work page 2019
[30]

Schuller, and Jianhua Tao

Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mingyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, Bj ¨orn W. Schuller, and Jianhua Tao. Mer 2023: Multi-label learning, modality robustness, and semi- supervised learning. InProceedings of the 31st ACM inter- national conference on...

work page 2023
[31]

Liu, and Jianhua Tao

Zheng Lian, Licai Sun, Mingyu Xu, Haiyang Sun, Ke Xu, Zhuofan Wen, Shun Chen, B. Liu, and Jianhua Tao. Ex- plainable multimodal emotion recognition.arXiv preprint arXiv:2306.15401, 2023. 5, 7, 3

work page arXiv 2023
[32]

Schuller, and Jianhua Tao

Zheng Lian, Haiyang Sun, Licai Sun, Zhuofan Wen, Siyuan Zhang, Shun Chen, Hao Gu, Jinming Zhao, Ziyang Ma, Xie Chen, Jiangyan Yi, Rui Liu, Kele Xu, Bin Liu, Erik Cam- bria, Guoying Zhao, Bj ¨orn W. Schuller, and Jianhua Tao. Mer 2024: Semi-supervised learning, noise robustness, and open-vocabulary multimodal emotion recognition. InPro- ceedings of the 2nd...

work page 2024
[33]

Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models

Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, et al. Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models. InProceedings of the International Conference on Machine Learning (ICML) (Oral, Top 1%), 2025. 1, 2, 3, 5, 6, 7, 8

work page 2025
[34]

Ov-mer: Towards open-vocabulary multimodal emotion recognition

Zheng Lian, Haiyang Sun, Licai Sun, Haoyu Chen, Lan Chen, Hao Gu, Zhuofan Wen, Shun Chen, Siyuan Zhang, Hailiang Yao, Bin Liu, Rui Liu, Shan Liang, Ya Li, Jiangyan Yi, and Jianhua Tao. Ov-mer: Towards open-vocabulary multimodal emotion recognition. InProceedings of the In- ternational Conference on Machine Learning (ICML), 2025. 6, 3

work page 2025
[35]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

E3rg: Building explicit emotion-driven empathetic response gen- eration system with multimodal large language model

Ronghao Lin, Shuai Shen, Weipeng Hu, Qiaolin He, Aolin Xiong, Li Huang, Haifeng Hu, and Yap-peng Tan. E3rg: Building explicit emotion-driven empathetic response gen- eration system with multimodal large language model. In Proceedings of the 33rd ACM International Conference on Multimedia, page 14006–14013. Association for Computing Machinery, 2025. 2, 7, 3

work page 2025
[37]

Speak from heart: an emotion-guided llm-based multimodal method for emotional dialogue generation

Chenxiao Liu, Zheyong Xie, Sirui Zhao, Jin Zhou, Tong Xu, Minglei Li, and Enhong Chen. Speak from heart: an emotion-guided llm-based multimodal method for emotional dialogue generation. InProceedings of the 2024 Interna- tional Conference on Multimedia Retrieval, pages 533–542,

work page 2024
[38]

Make acoustic and visual cues matter: Ch- sims v2

Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqi- uyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, and Kai Gao. Make acoustic and visual cues matter: Ch- sims v2. 0 dataset and av-mixup consistent module. InPro- ceedings of the 2022 international conference on multimodal interaction, pages 247–258, 2022. 6, 3

work page 2022
[39]

Emollms: A series of emotional large language models and annotation tools for comprehen- sive affective analysis

Zhiwei Liu, Kailai Yang, Qianqian Xie, Tianlin Zhang, and Sophia Ananiadou. Emollms: A series of emotional large language models and annotation tools for comprehen- sive affective analysis. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5487–5496, 2024. 1, 2, 3

work page 2024
[40]

Ola: Pushing the frontiers of omni-modal language model.arXiv e-prints, pages arXiv– 2502, 2025

Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model.arXiv e-prints, pages arXiv– 2502, 2025. 7

work page 2025
[41]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations (ICLR), 2019. 5

work page 2019
[42]

Learning multi-dimensional edge feature- based au relation graph for facial action unit recognition

Cheng Luo, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. Learning multi-dimensional edge feature- based au relation graph for facial action unit recognition. In Proceedings of the Thirty-First International Joint Confer- ence on Artificial Intelligence, IJCAI-22, pages 1239–1246. International Joint Conferences on Artificial Intelligence Or- g...

work page 2022
[43]

Kartik Narayan, Vibashan VS, Rama Chellappa, and Vishal M. Patel. Facexformer: A unified transformer for facial analysis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pages 11369–11382, 2025. 3

work page 2025
[44]

GPT-4 Technical Report

OpenAI et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Cross-corpus speech emotion recognition with hubert self-supervised representation

Miguel A Pastor, Dayana Ribas, Alfonso Ortega, Antonio Miguel, and Eduardo Lleida. Cross-corpus speech emotion recognition with hubert self-supervised representation. In IberSPEECH 2022, pages 76–80. ISCA, 2022. 4

work page 2022
[46]

Carat: contrastive feature reconstruction and aggregation for multi- modal multi-label emotion recognition

Cheng Peng, Ke Chen, Lidan Shou, and Gang Chen. Carat: contrastive feature reconstruction and aggregation for multi- modal multi-label emotion recognition. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14581– 14589, 2024. 1

work page 2024
[47]

MIT press, 2000

Rosalind W Picard.Affective computing. MIT press, 2000. 1

work page 2000
[48]

Multi-task learning with sentiment, emotion, and target detection to recognize hate speech and offensive language

Flor Miriam Plaza-del Arco, Sercan Halat, Sebastian Pad ´o, and Roman Klinger. Multi-task learning with sentiment, emotion, and target detection to recognize hate speech and offensive language. InProceedings of the Forum for Infor- mation Retrieval Evaluation (FIRE 2021), pages 297–318. CEUR-WS.org, 2021. 1

work page 2021
[49]

Empathy: Its ultimate and proximate bases.Behavioral and brain sci- ences, 25(1):1–20, 2002

Stephanie D Preston and Frans BM De Waal. Empathy: Its ultimate and proximate bases.Behavioral and brain sci- ences, 25(1):1–20, 2002. 1

work page 2002
[50]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Jun- yang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...

work page 2025
[51]

Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and I

Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and I. Sutskever. Learning transferable visual models from nat- ural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5

work page 2021
[52]

Cem: Commonsense-aware empathetic response generation

Sahand Sabour, Chujie Zheng, and Minlie Huang. Cem: Commonsense-aware empathetic response generation. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 11229–11237, 2022. 1

work page 2022
[53]

EmoBench: Evaluating the emotional intelligence of large language models

Sahand Sabour, Siyang Liu, Zheyuan Zhang, June Liu, Jin- feng Zhou, Alvionna Sunaryo, Tatia Lee, Rada Mihalcea, and Minlie Huang. EmoBench: Evaluating the emotional intelligence of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Com- putational Lingui...

work page 2024
[54]

Facial expression and attributes recog- nition based on multi-task learning of lightweight neural net- works

Andrey V Savchenko. Facial expression and attributes recog- nition based on multi-task learning of lightweight neural net- works. In2021 IEEE 19th international symposium on intel- ligent systems and informatics (SISY), pages 119–124. IEEE,

work page
[55]

Multimodal neurons in pre- trained text-only transformers

Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David Bau, and Antonio Torralba. Multimodal neurons in pre- trained text-only transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2862–2867, 2023. 3

work page 2023
[56]

Salmonn: Towards generic hearing abilities for large lan- guage models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large lan- guage models. InProceedings of the International Confer- ence on Learning Representations (ICLR 2024), 2024. 2, 6

work page 2024
[57]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Ferv39k: A large-scale multi-scene dataset for fa- cial expression recognition in videos

Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. Ferv39k: A large-scale multi-scene dataset for fa- cial expression recognition in videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20922–20931, 2022. 4, 3

work page 2022
[59]

Internlm2

Zijian Wu, Suozhi Huang, Zhejian Zhou, Huaiyuan Ying, Jiayu Wang, Dahua Lin, and Kai Chen. Internlm2. 5- stepprover: Advancing automated theorem proving via ex- pert iteration on large-scale lean problems.arXiv preprint arXiv:2410.15700, 2024. 2

work page arXiv 2024
[60]

Emovit: Revolutionizing emotion insights with vi- sual instruction tuning

Hongxia Xie, Chu-Jun Peng, Yu-Wen Tseng, Hung-Jen Chen, Chan-Feng Hsu, Hong-Han Shuai, and Wen-Huang Cheng. Emovit: Revolutionizing emotion insights with vi- sual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26596–26605, 2024. 1, 2

work page 2024
[61]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Jun- yang Lin. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Emollm: Multimodal emo- tional understanding meets large language models.arXiv preprint arXiv:2406.16442, 2024

Qu Yang, Mang Ye, and Bo Du. Emollm: Multimodal emo- tional understanding meets large language models.arXiv preprint arXiv:2406.16442, 2024. 1

work page arXiv 2024
[63]

Omni- emotion: Extending video mllm with detailed face and audio modeling for multimodal emotion analysis.arXiv preprint arXiv:2501.09502, 2025

Qize Yang, Detao Bai, Yi-Xing Peng, and Xihan Wei. Omni- emotion: Extending video mllm with detailed face and audio modeling for multimodal emotion analysis.arXiv preprint arXiv:2501.09502, 2025. 1, 2, 3, 6

work page arXiv 2025
[64]

Uncertain multimodal intention and emotion understanding in the wild

Qu Yang, Qinghongya Shi, Tongxin Wang, and Mang Ye. Uncertain multimodal intention and emotion understanding in the wild. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1

work page 2025
[65]

Ch-sims: A chinese multimodal sentiment analysis dataset with fine- grained annotation of modality

Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. Ch-sims: A chinese multimodal sentiment analysis dataset with fine- grained annotation of modality. InProceedings of the 58th annual meeting of the association for computational linguis- tics, pages 3718–3727, 2020. 6, 3

work page 2020
[66]

MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos.arXiv preprint arXiv:1606.06259, 2016. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2016
[67]

Mintrec: A new dataset for multi- modal intent recognition

Hanlei Zhang, Hua Xu, Xin Wang, Qianrui Zhou, Shaojie Zhao, and Jiayan Teng. Mintrec: A new dataset for multi- modal intent recognition. InProceedings of the 30th ACM International Conference on Multimedia, page 1688–1697,

work page
[68]

Video-LLaMA: An instruction-tuned audio-visual language model for video un- derstanding

Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video un- derstanding. InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics,

work page 2023
[69]

MIntrec2.0: A large-scale benchmark dataset for multimodal intent recognition and out-of-scope detection in conversa- tions

Hanlei Zhang, Xin Wang, Hua Xu, Qianrui Zhou, Kai Gao, Jianhua Su, jinyue Zhao, Wenrui Li, and Yanting Chen. MIntrec2.0: A large-scale benchmark dataset for multimodal intent recognition and out-of-scope detection in conversa- tions. InThe Twelfth International Conference on Learning Representations, 2024. 4, 5, 3

work page 2024
[70]

Can large language models help multimodal language analy- sis? mmla: A comprehensive benchmark.arXiv preprint arXiv:2504.16427, 2025

Hanlei Zhang, Zhuohang Li, Yeshuang Zhu, Hua Xu, Peiwu Wang, Haige Zhu, Jie Zhou, and Jinchao Zhang. Can large language models help multimodal language analy- sis? mmla: A comprehensive benchmark.arXiv preprint arXiv:2504.16427, 2025. 2

work page arXiv 2025
[71]

Towards multimodal empathetic response generation: A rich text-speech-vision avatar-based benchmark

Han Zhang, Zixiang Meng, Meng Luo, Hong Han, Lizi Liao, Erik Cambria, and Hao Fei. Towards multimodal empathetic response generation: A rich text-speech-vision avatar-based benchmark. InProceedings of the ACM on Web Conference 2025, pages 2872–2881, 2025. 1, 2, 5, 7, 3

work page 2025
[72]

Sitao Zhang, Yimu Pan, and James Z. Wang. Learning emo- tion representations from verbal and nonverbal communica- tion. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 1

work page 2023
[73]

Weakly supervised video emotion detection and prediction via cross- 8 modal temporal erasing network

Zhicheng Zhang, Lijuan Wang, and Jufeng Yang. Weakly supervised video emotion detection and prediction via cross- 8 modal temporal erasing network. In2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

work page
[74]

M3ED: Multi- modal multi-scene multi-label emotional dialogue database

Jinming Zhao, Tenggan Zhang, Jingwen Hu, Yuchen Liu, Qin Jin, Xinchao Wang, and Haizhou Li. M3ED: Multi- modal multi-scene multi-label emotional dialogue database. InProceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 5699–5710. Association for Computational Linguis- tics, 2022. 4, 3

work page 2022
[75]

R1-omni: Ex- plainable omni-multimodal emotion recognition with rein- forcement learning.arXiv preprint arXiv:2503.05379, 2025

Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Ex- plainable omni-multimodal emotion recognition with rein- forcement learning.arXiv preprint arXiv:2503.05379, 2025. 1, 2, 3, 6

work page arXiv 2025
[76]

Improving multimodal emotion recognition by lever- aging acoustic adaptation and visual alignment

Zhixian Zhao, Haifeng Chen, Xi Li, Dongmei Jiang, and Lei Xie. Improving multimodal emotion recognition by lever- aging acoustic adaptation and visual alignment. InProceed- ings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, pages 67–71, 2024. 4

work page 2024
[77]

LLM-guided semantic relational reasoning for multimodal intent recognition

Qianrui Zhou, Hua Xu, Yifan Wang, Xinzhi Dong, and Han- lei Zhang. LLM-guided semantic relational reasoning for multimodal intent recognition. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing, pages 22221–22237. Association for Computational Linguistics, 2025. 2 9

work page 2025