Recognition: 2 theorem links
· Lean TheoremNano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
Pith reviewed 2026-05-15 17:38 UTC · model grok-4.3
The pith
A 2.2 billion parameter multimodal model unifies six emotional tasks from raw perception to empathetic interaction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Nano-EmoX is the first compact multimodal language model of 2.2B parameters that unifies six core affective tasks across the full three-level hierarchy of perception, understanding, and interaction. It achieves this by combining enhanced facial and fusion encoders, projecting their outputs into a shared language space through heterogeneous adapters, and training the system with the P2E curriculum that progressively links rapid perception to chain-of-thought-driven empathy, yielding state-of-the-art or highly competitive results on multiple benchmarks together with clear efficiency and generalization gains.
What carries the argument
A cognitively inspired three-level hierarchy (perception, understanding, interaction) that organizes affective tasks by cognitive depth and directs both the model architecture and the P2E progressive training curriculum.
If this is right
- A single compact model can now perform perception, understanding, and interaction tasks without the fragmentation that previously required separate specialized systems.
- Progressive curriculum alignment from perceptual encoders to chain-of-thought reasoning improves transfer across emotional benchmarks.
- Heterogeneous adapters allow omni-modal cues to be injected into a lightweight language model while keeping total size at 2.2B parameters.
- The same hierarchy can be reused to add new affective tasks without retraining the entire model from scratch.
Where Pith is reading between the lines
- The hierarchy may generalize beyond emotion to other cognitive domains where low-level sensory input must be lifted to deliberative interaction.
- Because the model stays small, the approach could enable on-device emotional intelligence in robots or mobile apps where larger models are impractical.
- Future work could test whether inserting explicit uncertainty estimates at each hierarchy level further improves robustness on ambiguous emotional inputs.
Load-bearing premise
The three-level cognitive hierarchy accurately captures how emotional intelligence builds from perception to empathy and therefore supplies an effective blueprint for model design and training that produces cross-task transfer.
What would settle it
Train an otherwise identical 2.2B model on the same six tasks but without the hierarchy-guided architecture or P2E curriculum and measure whether cross-task generalization and benchmark scores drop substantially below the reported Nano-EmoX results.
Figures
read the original abstract
The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization. The code is available at https://github.com/waHAHJIAHAO/Nano-EmoX.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Nano-EmoX, a 2.2B-parameter compact multimodal language model that unifies six core affective tasks across a cognitively inspired three-level hierarchy (perception, understanding, interaction). It introduces omni-modal encoders (including an enhanced facial encoder and fusion encoder), heterogeneous adapters to project into language space, and the P2E curriculum-based training framework to progressively align perception with chain-of-thought empathy, claiming this is the first such compact MLM to achieve SOTA or highly competitive results across multiple benchmarks with good efficiency and generalization.
Significance. If the results hold and the hierarchy's contribution can be isolated, the work would be significant as an efficient, unified approach to affective multimodal modeling that bridges low-level perception and high-level interaction in a single compact model. The public code release supports reproducibility and could enable follow-up work on emotional intelligence in resource-constrained settings.
major comments (2)
- [Experiments] Experiments section: The central claim attributes unification, cross-task transferability, and generalization to the three-level hierarchy plus P2E curriculum, yet no ablation is reported that removes the staged curriculum (e.g., simultaneous multitask training or random ordering) while freezing the omni-modal encoders, fusion encoder, and heterogeneous adapters. Without this isolation, performance gains on the six tasks cannot be attributed to the cognitively inspired structure rather than the underlying multimodal components.
- [Results] Results section: The abstract asserts SOTA or highly competitive performance across benchmarks, but the provided text supplies no quantitative metrics, specific benchmark names, error bars, baseline comparisons, or ablation tables. These details are load-bearing for substantiating the efficiency and generalization claims and must be presented with full tables and statistical analysis.
minor comments (2)
- [Abstract] Abstract: The phrase 'to the best of our knowledge' is standard but could be strengthened by briefly noting the exact six tasks and the three hierarchy levels for immediate clarity.
- [Method] Notation: The description of 'heterogeneous adapters' and 'omni-modal encoders' would benefit from a single diagram or table summarizing input modalities and projection dimensions to aid readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which helps clarify the presentation of our contributions. We address each major comment below and commit to revisions that strengthen the manuscript without misrepresenting our work.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central claim attributes unification, cross-task transferability, and generalization to the three-level hierarchy plus P2E curriculum, yet no ablation is reported that removes the staged curriculum (e.g., simultaneous multitask training or random ordering) while freezing the omni-modal encoders, fusion encoder, and heterogeneous adapters. Without this isolation, performance gains on the six tasks cannot be attributed to the cognitively inspired structure rather than the underlying multimodal components.
Authors: We agree that the manuscript would benefit from an explicit ablation isolating the P2E curriculum's contribution. The current version does not include a direct comparison of staged curriculum training against simultaneous multitask training or random ordering with encoders and adapters held fixed. In the revised manuscript, we will add this ablation experiment to the Experiments section, reporting performance deltas on the six tasks and discussing how the results support attribution to the curriculum structure. revision: yes
-
Referee: [Results] Results section: The abstract asserts SOTA or highly competitive performance across benchmarks, but the provided text supplies no quantitative metrics, specific benchmark names, error bars, baseline comparisons, or ablation tables. These details are load-bearing for substantiating the efficiency and generalization claims and must be presented with full tables and statistical analysis.
Authors: We acknowledge that the version reviewed did not present the full quantitative details in a readily accessible form. The manuscript contains benchmark results, but to address this concern we will expand the Results section with complete tables listing all quantitative metrics, specific benchmark names for the six tasks, error bars, baseline comparisons, and existing ablation tables. We will also incorporate statistical analysis (e.g., significance tests) and revise the abstract to include key performance numbers supporting the SOTA/competitive claims. revision: yes
Circularity Check
No circularity: hierarchy and model presented as novel construction
full rationale
The manuscript introduces a three-level hierarchy as a guiding conceptual framework and describes Nano-EmoX plus P2E curriculum as newly constructed components whose performance is evaluated on external benchmarks. No equations, fitted parameters, or self-citations appear in the provided text that would reduce any claimed unification or generalization to a definitional equivalence or by-construction prediction. The central claims rest on architectural choices and empirical results rather than any step that collapses back to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- 2.2B model scale
axioms (1)
- domain assumption A cognitively inspired three-level hierarchy organizes affective tasks according to cognitive depth and supports unified modeling.
invented entities (1)
-
P2E curriculum-based training framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth—perception, understanding, and interaction—and provides a unified conceptual foundation... P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and embed_strictMono echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Phase 1: Foundational Modality Alignment... Phase 2: Cross-modal Fusion Pre-training... Phase 3: Multitask Instruction Tuning... shallow-to-deep cognitive progression
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Panos Achlioptas, Maks Ovsjanikov, Leonidas J. Guibas, and Sergey Tulyakov. Affection: Learning affective explana- tions for real-world visual data. In2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),
-
[2]
Multi-task learning for multi-modal emotion recognition and sentiment analysis
Md Shad Akhtar, Dushyant Chauhan, Deepanway Ghosal, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya. Multi-task learning for multi-modal emotion recognition and sentiment analysis. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long and Sho...
work page 2019
-
[3]
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal lan- guage analysis in the wild: Cmu-mosei dataset and inter- pretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 2236–2246, 2018. 6, 3
work page 2018
-
[4]
Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). InInternational Conference on Computer Vision, 2017. 7
work page 2017
-
[5]
Iemo- cap: Interactive emotional dyadic motion capture database
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemo- cap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008. 6, 3
work page 2008
-
[6]
Speech emotion recognition with multi- task learning
Xingyu Cai, Jiahong Yuan, Renjie Zheng, Liang Huang, and Kenneth Church. Speech emotion recognition with multi- task learning. InInterspeech, pages 4508–4512. Brno, 2021. 1
work page 2021
-
[7]
Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014. 4, 3
work page 2014
-
[8]
Feiyu Chen, Jie Shao, Shuyuan Zhu, and Heng Tao Shen. Multivariate, multi-frequency and multimodal: Rethinking graph neural networks for emotion recognition in conversa- tion. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 1
work page 2023
-
[9]
Finecliper: Multi-modal fine-grained clip for dynamic facial expression recognition with adapters
Haodong Chen, Haojian Huang, Junhao Dong, Mingzhe Zheng, and Dian Shao. Finecliper: Multi-modal fine-grained clip for dynamic facial expression recognition with adapters. 4 Table 5. Results of the ablation study on task composition in phase 3 of P2E. This table investigates the model’s sensitivity to different proportions of training tasks. P2E Phase3 Tas...
work page 2024
-
[10]
Improv- ing multi-turn emotional support dialogue generation with lookahead strategy planning
Yi Cheng, Wenge Liu, Wenjie Li, Jiashuo Wang, Ruihui Zhao, Bang Liu, Xiaodan Liang, and Yefeng Zheng. Improv- ing multi-turn emotional support dialogue generation with lookahead strategy planning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, 2022. 1
work page 2022
-
[11]
Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxi- ang Lin, Zheng Lian, Xiaojiang Peng, and Alexander Haupt- mann. Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.Advances in Neural Infor- mation Processing Systems, 37:110805–110853, 2024. 1, 2, 3, 5, 6
work page 2024
-
[12]
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024. 6, 7
-
[13]
Towards multimodal emotional support con- versation systems.IEEE Transactions on Multimedia, 2025
Yuqi Chu, Lizi Liao, Zhiyuan Zhou, Chong-Wah Ngo, and Richang Hong. Towards multimodal emotional support con- versation systems.IEEE Transactions on Multimedia, 2025. 2
work page 2025
-
[14]
Learning aligned audiovisual representations for multimodal senti- ment analysis
Chaoyue Ding, Daoming Zong, Baoxiang Li, Ken Zheng, Dinghao Zhou, Jiakui Li, and Qunyan Zhou. Learning aligned audiovisual representations for multimodal senti- ment analysis. InProceedings of the 1st International Work- shop on Multimodal and Responsible Affective Computing, pages 21–28, 2023. 4
work page 2023
-
[15]
Reasoning implicit sentiment with chain-of- thought prompting
Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and Tat- Seng Chua. Reasoning implicit sentiment with chain-of- thought prompting. InProceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 2: Short Papers), 2023. 1
work page 2023
-
[16]
Mike Schaekermann Gheorghe Comanici, Eric Bieber et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic ca- pabilities.arXiv preprint arXiv:2507.06261, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Unimodal multi-task fusion for emo- tional mimicry intensity prediction
Tobias Hallmen, Fabian Deuser, Norbert Oswald, and Elisabeth Andr ´e. Unimodal multi-task fusion for emo- tional mimicry intensity prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4657–4665, 2024. 1
work page 2024
-
[18]
Towards understanding neural machine translation with word importance
Shilin He, Zhaopeng Tu, Xing Wang, Longyue Wang, Michael Lyu, and Shuming Shi. Towards understanding neural machine translation with word importance. InPro- ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 953–962. Associatio...
work page 2019
-
[19]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 4
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29: 3451–3460, 2021. 5
work page 2021
-
[21]
Emobench-m: Benchmarking emotional intelli- gence for multimodal large language models (2025), 2025
H Hu, Y Zhou, L You, H Xu, Q Wang, Z Lian, FR Yu, F Ma, and L Cui. Emobench-m: Benchmarking emotional intelli- gence for multimodal large language models (2025), 2025. 2
work page 2025
-
[22]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zhen Leng Thai, Kai Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm: Un- veiling the potential of small languag...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Emotion-qwen: Training hybrid experts for unified emotion and general vision-language understanding
Dawei Huang, Qing Li, Chuan Yan, Zebang Cheng, Yurong Huang, Xiang Li, Bin Li, Xiaohui Wang, Zheng Lian, and Xiaojiang Peng. Emotion-qwen: Training hybrid experts for unified emotion and general vision-language understanding. arXiv preprint arXiv:2505.06685, 2025. 1, 2, 6
-
[24]
Human- ai interaction research agenda: A user-centered perspective
Tingting Jiang, Zhumo Sun, Shiting Fu, and Yan Lv. Human- ai interaction research agenda: A user-centered perspective. Data and Information Management, 8(4):100078, 2024. 1
work page 2024
-
[25]
Context-aware emotion recognition net- works
Jiyoung Lee, Seungryong Kim, Sunok Kim, Jungin Park, and Kwanghoon Sohn. Context-aware emotion recognition net- works. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 10143–10152, 2019. 4, 3
work page 2019
-
[26]
Ao Li, Longwei Xu, Chen Ling, Jinghui Zhang, and Peng- wei Wang. Emoverse: Enhancing multimodal large language models for affective computing via multitask learning.Neu- rocomputing, 650:130810, 2025. 2, 3
work page 2025
-
[27]
A Diversity-Promoting Objective Function for Neural Conversation Models
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective func- 5 In the text, the caption reads: "Saying 'I swear' is useless even if you're lying one hundred times." This phrase might be a woman's evaluation or response to someone. Given the speaker's's tone described in the audio cues as tense and agitated, ...
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[28]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3
work page 2023
-
[29]
Jeng-Lin Li and Chi-Chun Lee. Attentive to individual: A multimodal emotion recognition network with personalized attention profile. InInterspeech, pages 211–215, 2019. 1 6
work page 2019
-
[30]
Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mingyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, Bj ¨orn W. Schuller, and Jianhua Tao. Mer 2023: Multi-label learning, modality robustness, and semi- supervised learning. InProceedings of the 31st ACM inter- national conference on...
work page 2023
-
[31]
Zheng Lian, Licai Sun, Mingyu Xu, Haiyang Sun, Ke Xu, Zhuofan Wen, Shun Chen, B. Liu, and Jianhua Tao. Ex- plainable multimodal emotion recognition.arXiv preprint arXiv:2306.15401, 2023. 5, 7, 3
-
[32]
Zheng Lian, Haiyang Sun, Licai Sun, Zhuofan Wen, Siyuan Zhang, Shun Chen, Hao Gu, Jinming Zhao, Ziyang Ma, Xie Chen, Jiangyan Yi, Rui Liu, Kele Xu, Bin Liu, Erik Cam- bria, Guoying Zhao, Bj ¨orn W. Schuller, and Jianhua Tao. Mer 2024: Semi-supervised learning, noise robustness, and open-vocabulary multimodal emotion recognition. InPro- ceedings of the 2nd...
work page 2024
-
[33]
Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, et al. Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models. InProceedings of the International Conference on Machine Learning (ICML) (Oral, Top 1%), 2025. 1, 2, 3, 5, 6, 7, 8
work page 2025
-
[34]
Ov-mer: Towards open-vocabulary multimodal emotion recognition
Zheng Lian, Haiyang Sun, Licai Sun, Haoyu Chen, Lan Chen, Hao Gu, Zhuofan Wen, Shun Chen, Siyuan Zhang, Hailiang Yao, Bin Liu, Rui Liu, Shan Liang, Ya Li, Jiangyan Yi, and Jianhua Tao. Ov-mer: Towards open-vocabulary multimodal emotion recognition. InProceedings of the In- ternational Conference on Machine Learning (ICML), 2025. 6, 3
work page 2025
-
[35]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Ronghao Lin, Shuai Shen, Weipeng Hu, Qiaolin He, Aolin Xiong, Li Huang, Haifeng Hu, and Yap-peng Tan. E3rg: Building explicit emotion-driven empathetic response gen- eration system with multimodal large language model. In Proceedings of the 33rd ACM International Conference on Multimedia, page 14006–14013. Association for Computing Machinery, 2025. 2, 7, 3
work page 2025
-
[37]
Speak from heart: an emotion-guided llm-based multimodal method for emotional dialogue generation
Chenxiao Liu, Zheyong Xie, Sirui Zhao, Jin Zhou, Tong Xu, Minglei Li, and Enhong Chen. Speak from heart: an emotion-guided llm-based multimodal method for emotional dialogue generation. InProceedings of the 2024 Interna- tional Conference on Multimedia Retrieval, pages 533–542,
work page 2024
-
[38]
Make acoustic and visual cues matter: Ch- sims v2
Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqi- uyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, and Kai Gao. Make acoustic and visual cues matter: Ch- sims v2. 0 dataset and av-mixup consistent module. InPro- ceedings of the 2022 international conference on multimodal interaction, pages 247–258, 2022. 6, 3
work page 2022
-
[39]
Zhiwei Liu, Kailai Yang, Qianqian Xie, Tianlin Zhang, and Sophia Ananiadou. Emollms: A series of emotional large language models and annotation tools for comprehen- sive affective analysis. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5487–5496, 2024. 1, 2, 3
work page 2024
-
[40]
Ola: Pushing the frontiers of omni-modal language model.arXiv e-prints, pages arXiv– 2502, 2025
Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model.arXiv e-prints, pages arXiv– 2502, 2025. 7
work page 2025
-
[41]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations (ICLR), 2019. 5
work page 2019
-
[42]
Learning multi-dimensional edge feature- based au relation graph for facial action unit recognition
Cheng Luo, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. Learning multi-dimensional edge feature- based au relation graph for facial action unit recognition. In Proceedings of the Thirty-First International Joint Confer- ence on Artificial Intelligence, IJCAI-22, pages 1239–1246. International Joint Conferences on Artificial Intelligence Or- g...
work page 2022
-
[43]
Kartik Narayan, Vibashan VS, Rama Chellappa, and Vishal M. Patel. Facexformer: A unified transformer for facial analysis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pages 11369–11382, 2025. 3
work page 2025
-
[44]
OpenAI et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Cross-corpus speech emotion recognition with hubert self-supervised representation
Miguel A Pastor, Dayana Ribas, Alfonso Ortega, Antonio Miguel, and Eduardo Lleida. Cross-corpus speech emotion recognition with hubert self-supervised representation. In IberSPEECH 2022, pages 76–80. ISCA, 2022. 4
work page 2022
-
[46]
Cheng Peng, Ke Chen, Lidan Shou, and Gang Chen. Carat: contrastive feature reconstruction and aggregation for multi- modal multi-label emotion recognition. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14581– 14589, 2024. 1
work page 2024
- [47]
-
[48]
Flor Miriam Plaza-del Arco, Sercan Halat, Sebastian Pad ´o, and Roman Klinger. Multi-task learning with sentiment, emotion, and target detection to recognize hate speech and offensive language. InProceedings of the Forum for Infor- mation Retrieval Evaluation (FIRE 2021), pages 297–318. CEUR-WS.org, 2021. 1
work page 2021
-
[49]
Empathy: Its ultimate and proximate bases.Behavioral and brain sci- ences, 25(1):1–20, 2002
Stephanie D Preston and Frans BM De Waal. Empathy: Its ultimate and proximate bases.Behavioral and brain sci- ences, 25(1):1–20, 2002. 1
work page 2002
-
[50]
Qwen2.5 technical report, 2025
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Jun- yang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...
work page 2025
-
[51]
Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and I. Sutskever. Learning transferable visual models from nat- ural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5
work page 2021
-
[52]
Cem: Commonsense-aware empathetic response generation
Sahand Sabour, Chujie Zheng, and Minlie Huang. Cem: Commonsense-aware empathetic response generation. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 11229–11237, 2022. 1
work page 2022
-
[53]
EmoBench: Evaluating the emotional intelligence of large language models
Sahand Sabour, Siyang Liu, Zheyuan Zhang, June Liu, Jin- feng Zhou, Alvionna Sunaryo, Tatia Lee, Rada Mihalcea, and Minlie Huang. EmoBench: Evaluating the emotional intelligence of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Com- putational Lingui...
work page 2024
-
[54]
Andrey V Savchenko. Facial expression and attributes recog- nition based on multi-task learning of lightweight neural net- works. In2021 IEEE 19th international symposium on intel- ligent systems and informatics (SISY), pages 119–124. IEEE,
-
[55]
Multimodal neurons in pre- trained text-only transformers
Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David Bau, and Antonio Torralba. Multimodal neurons in pre- trained text-only transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2862–2867, 2023. 3
work page 2023
-
[56]
Salmonn: Towards generic hearing abilities for large lan- guage models
Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large lan- guage models. InProceedings of the International Confer- ence on Learning Representations (ICLR 2024), 2024. 2, 6
work page 2024
-
[57]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Ferv39k: A large-scale multi-scene dataset for fa- cial expression recognition in videos
Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. Ferv39k: A large-scale multi-scene dataset for fa- cial expression recognition in videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20922–20931, 2022. 4, 3
work page 2022
- [59]
-
[60]
Emovit: Revolutionizing emotion insights with vi- sual instruction tuning
Hongxia Xie, Chu-Jun Peng, Yu-Wen Tseng, Hung-Jen Chen, Chan-Feng Hsu, Hong-Han Shuai, and Wen-Huang Cheng. Emovit: Revolutionizing emotion insights with vi- sual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26596–26605, 2024. 1, 2
work page 2024
-
[61]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Jun- yang Lin. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Qu Yang, Mang Ye, and Bo Du. Emollm: Multimodal emo- tional understanding meets large language models.arXiv preprint arXiv:2406.16442, 2024. 1
-
[63]
Qize Yang, Detao Bai, Yi-Xing Peng, and Xihan Wei. Omni- emotion: Extending video mllm with detailed face and audio modeling for multimodal emotion analysis.arXiv preprint arXiv:2501.09502, 2025. 1, 2, 3, 6
-
[64]
Uncertain multimodal intention and emotion understanding in the wild
Qu Yang, Qinghongya Shi, Tongxin Wang, and Mang Ye. Uncertain multimodal intention and emotion understanding in the wild. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1
work page 2025
-
[65]
Ch-sims: A chinese multimodal sentiment analysis dataset with fine- grained annotation of modality
Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. Ch-sims: A chinese multimodal sentiment analysis dataset with fine- grained annotation of modality. InProceedings of the 58th annual meeting of the association for computational linguis- tics, pages 3718–3727, 2020. 6, 3
work page 2020
-
[66]
MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos.arXiv preprint arXiv:1606.06259, 2016. 6, 3
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[67]
Mintrec: A new dataset for multi- modal intent recognition
Hanlei Zhang, Hua Xu, Xin Wang, Qianrui Zhou, Shaojie Zhao, and Jiayan Teng. Mintrec: A new dataset for multi- modal intent recognition. InProceedings of the 30th ACM International Conference on Multimedia, page 1688–1697,
-
[68]
Video-LLaMA: An instruction-tuned audio-visual language model for video un- derstanding
Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video un- derstanding. InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics,
work page 2023
-
[69]
Hanlei Zhang, Xin Wang, Hua Xu, Qianrui Zhou, Kai Gao, Jianhua Su, jinyue Zhao, Wenrui Li, and Yanting Chen. MIntrec2.0: A large-scale benchmark dataset for multimodal intent recognition and out-of-scope detection in conversa- tions. InThe Twelfth International Conference on Learning Representations, 2024. 4, 5, 3
work page 2024
-
[70]
Hanlei Zhang, Zhuohang Li, Yeshuang Zhu, Hua Xu, Peiwu Wang, Haige Zhu, Jie Zhou, and Jinchao Zhang. Can large language models help multimodal language analy- sis? mmla: A comprehensive benchmark.arXiv preprint arXiv:2504.16427, 2025. 2
-
[71]
Towards multimodal empathetic response generation: A rich text-speech-vision avatar-based benchmark
Han Zhang, Zixiang Meng, Meng Luo, Hong Han, Lizi Liao, Erik Cambria, and Hao Fei. Towards multimodal empathetic response generation: A rich text-speech-vision avatar-based benchmark. InProceedings of the ACM on Web Conference 2025, pages 2872–2881, 2025. 1, 2, 5, 7, 3
work page 2025
-
[72]
Sitao Zhang, Yimu Pan, and James Z. Wang. Learning emo- tion representations from verbal and nonverbal communica- tion. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 1
work page 2023
-
[73]
Weakly supervised video emotion detection and prediction via cross- 8 modal temporal erasing network
Zhicheng Zhang, Lijuan Wang, and Jufeng Yang. Weakly supervised video emotion detection and prediction via cross- 8 modal temporal erasing network. In2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),
-
[74]
M3ED: Multi- modal multi-scene multi-label emotional dialogue database
Jinming Zhao, Tenggan Zhang, Jingwen Hu, Yuchen Liu, Qin Jin, Xinchao Wang, and Haizhou Li. M3ED: Multi- modal multi-scene multi-label emotional dialogue database. InProceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 5699–5710. Association for Computational Linguis- tics, 2022. 4, 3
work page 2022
-
[75]
Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Ex- plainable omni-multimodal emotion recognition with rein- forcement learning.arXiv preprint arXiv:2503.05379, 2025. 1, 2, 3, 6
-
[76]
Improving multimodal emotion recognition by lever- aging acoustic adaptation and visual alignment
Zhixian Zhao, Haifeng Chen, Xi Li, Dongmei Jiang, and Lei Xie. Improving multimodal emotion recognition by lever- aging acoustic adaptation and visual alignment. InProceed- ings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, pages 67–71, 2024. 4
work page 2024
-
[77]
LLM-guided semantic relational reasoning for multimodal intent recognition
Qianrui Zhou, Hua Xu, Yifan Wang, Xinzhi Dong, and Han- lei Zhang. LLM-guided semantic relational reasoning for multimodal intent recognition. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing, pages 22221–22237. Association for Computational Linguistics, 2025. 2 9
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.