MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

Beier Zhu; Chengwei Qin; Dongchuan Ran; Hewei Guo; Jiangnan Chen; Jiayu Zhang; Lewei Lu; Wenwen Tong; Xinyi Wang; Xun Yang

arxiv: 2606.27652 · v1 · pith:TECFUADYnew · submitted 2026-06-26 · 💻 cs.AI

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

Zhiyuan Han , Beier Zhu , Wenwen Tong , Chengwei Qin , Xinyi Wang , Jiayu Zhang , Jiangnan Chen , Hewei Guo

show 3 more authors

Dongchuan Ran Lewei Lu Xun Yang

This is my paper

Pith reviewed 2026-06-29 05:04 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal emotion recognitionslow-fast thinkingreinforcement learningdual-objective optimizationconfidence calibrationreasoning modelsemotion classification

0 comments

The pith

MER-R1 uses reinforcement learning to combine fast and slow thinking so explicit reasoning improves multimodal emotion recognition accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that explicit reasoning in multimodal models often fails to raise emotion recognition accuracy even when it increases interpretability. Fast thinking produces broader, higher-recall predictions while slow thinking produces more selective, higher-precision ones. MER-R1 turns this observed complementarity into a single training procedure by disentangling the two signals into separate optimization objectives and then calibrating the slow answer against the fast intuition. The result is joint gains rather than a trade-off, with state-of-the-art numbers on the MER-UniBench and MME-Emotion benchmarks. A reader would care because the work shows how to make deliberative reasoning an active contributor to accuracy instead of an optional post-hoc explanation.

Core claim

MER-R1 is a reinforcement learning framework that unifies the recall-oriented intuition of fast thinking with the precision-oriented selectivity of slow thinking. It does so through dual-objective disentanglement that treats recall and precision as independent optimization signals and through slow-fast confidence calibration that strengthens correct emotion categories while suppressing incorrect ones. The paper supplies a theoretical argument that this synergy reduces variance-induced interference during optimization.

What carries the argument

Dual-objective disentanglement that separates recall and precision into independent optimization signals, paired with slow-fast confidence calibration that aligns the final slow-thinking answer with fast-thinking intuition.

If this is right

Recall and precision can be optimized together rather than traded off during training.
The calibrated slow-thinking output strengthens correct emotions and weakens incorrect ones.
Reasoning steps contribute directly to higher accuracy on emotion recognition tasks.
The same slow-fast structure yields state-of-the-art results on both MER-UniBench and MME-Emotion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same disentanglement technique might be tested on other multimodal tasks that mix intuitive and deliberative reasoning.
Variance reduction during optimization could be measured directly to check whether it explains the observed gains.
The framework could be applied to non-emotion domains such as visual question answering to see if the recall-precision separation generalizes.

Load-bearing premise

Separating recall and precision into independent optimization signals allows joint improvement without hidden trade-offs between them.

What would settle it

An experiment in which the dual-objective version on MER-UniBench shows lower recall, lower precision, or higher variance than single-objective fast or slow baselines would falsify the claimed synergy.

Figures

Figures reproduced from arXiv: 2606.27652 by Beier Zhu, Chengwei Qin, Dongchuan Ran, Hewei Guo, Jiangnan Chen, Jiayu Zhang, Lewei Lu, Wenwen Tong, Xinyi Wang, Xun Yang, Zhiyuan Han.

**Figure 1.** Figure 1: Thinking paradox and slow-fast synergy in MER. (a) Motivating example: slow thinking is cautious but may miss valid emotions, whereas fast thinking improves coverage but introduces noise. Slow-fast synergy combines their complementary strengths. (b) Evaluation summary on 9 datasets: Fast thinking counter-intuitively outperforms slow thinking. To understand this paradox, we analyze fast and slow thinking in… view at source ↗

**Figure 2.** Figure 2: Emotion wheel. Given a multimodal input x, the policy πθ generates an output y = (y cot, yans), where y cot denotes the chain-of-thought reasoning trace and y ans denotes the final emotion answer. Slow thinking follows the standard think-then-answer format, while fast thinking is elicited with an answer-only prompt that directly produces y ans without reasoning, i.e., y cot = ∅. Since both the model answe… view at source ↗

**Figure 3.** Figure 3: Analyses of the thinking paradox on five MER benchmarks. (a) Performance gaps in recall (R), precision (P), and F1 (F), where ∆M = Mfast − Mslow for M ∈ {R, P, F}. (b) Prediction diversity by the average number of emotion categories. (c) Confidence on ground-truth emotion categories. (d) Confidence margin between ground-truth and hard-negative emotion categories. results show a clear difference in predicti… view at source ↗

**Figure 4.** Figure 4: Framework of MER-R1. Our MER-R1 consists of dual-objective disentanglement for Desideratum I and slow-fast confidence calibration for Desideratum II. 3.4 Desiderata for slow-fast thinking synergy These observations suggest that an effective emotion reasoning model should not simply prefer one thinking mode over the other. Instead, it should satisfy two desiderata. • Desideratum I: At the prediction level, … view at source ↗

**Figure 5.** Figure 5: Effect of slow-fast thinking synergy (a) Recall. (b) Precision. (c) Confidence on ground-truth categories. (d) Confidence margin. MER-R1 improves recall, preserves precision, boosts confidence on ground-truth categories, and strengthens confidence discrimination. strongest reproduced baseline by 5.63 points. The improvement is consistent across fine-grained emotion recognition, basic emotion recognition, a… view at source ↗

**Figure 6.** Figure 6: Emotion wheels used in MER-UniBench evaluation. We adopt five emotion wheels (W1–W5) to map free-form emotion words into level-1 emotion categories. For the k-th wheel wk, the complete mapping function is defined as: Gwk (·) = F wk l3 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 8.** Figure 8: Retained hitrate comparision. tasks cover both controlled and in-the-wild scenarios, and evaluate different aspects of affective understanding, including categorical emotion recognition, fine-grained affective interpretation, sentiment polarity, and intent inference. Evaluation metrics. MME-Emotion reports three metrics. Recognition measures whether the final predicted emotion or affective label is correc… view at source ↗

read the original abstract

We find that explicit reasoning does not necessarily translate into better multimodal emotion recognition (MER) accuracy, even though it makes predictions more interpretable. Specifically, for reasoning-based MLLMs, fast thinking by triggering direct answers often outperforms slow thinking after deliberative reasoning. Our empirical analyses show that fast thinking improves recall with broader and more confident predictions, whereas slow thinking favors precision through conservative filtering of incorrect categories. Building on these insights, we propose MER-R1, a reinforcement learning framework that turns slow-fast complementarity into explicit optimization. Dual-objective disentanglement separates recall and precision into two optimization signals, allowing them to be jointly optimized rather than traded off against each other. Slow-fast confidence calibration further aligns the final slow-thinking answer with fast-thinking intuition, strengthening correct emotions while suppressing incorrect ones. In this way, MER-R1 unifies the recall-oriented intuition of fast thinking with the precision-oriented selectivity of slow thinking. We further provide theoretical justification for this synergy, showing that it mitigates variance-induced interference during optimization. Extensive experiments on MER-UniBench and MME-Emotion show that MER-R1 achieves state-of-the-art performance and makes reasoning genuinely benefit emotion recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MER-R1 turns an empirical observation about fast vs slow thinking in emotion recognition into a concrete RL framework with dual objectives and calibration, but the theory claim stays thin.

read the letter

The main thing to know is that the authors noticed fast direct answers beat deliberative reasoning on multimodal emotion recognition accuracy, then built MER-R1 to optimize both recall and precision together via disentangled objectives plus a calibration step that aligns the slow output with fast intuition.

What is new is the specific combination of dual-objective disentanglement and slow-fast confidence calibration inside an RL loop for this task. They treat the complementarity as something to exploit rather than a trade-off, and the experiments report SOTA numbers on MER-UniBench and MME-Emotion. That empirical starting point is straightforward and the method is described clearly enough to replicate in principle.

The soft spot is the theoretical justification. The abstract states that the synergy mitigates variance-induced interference, but supplies no derivation, loss formulation, or independence assumptions to check. Without those details it is difficult to tell whether the argument stands on its own or simply restates the desired outcome. If the full paper has a clean optimization model, that would fix it; otherwise the theory remains an assertion.

This is for researchers already working on multimodal emotion recognition or applied reasoning in MLLMs. A reader focused on that narrow area would get a usable recipe and the fast/slow analysis. It is not positioned as a general reasoning advance.

I would send it to peer review. The empirical hook is honest and the framework is specific enough to evaluate, even if the theory section will probably need expansion.

Referee Report

1 major / 1 minor

Summary. The paper claims that explicit reasoning in multimodal emotion recognition (MER) does not always improve accuracy, with fast thinking favoring recall and slow thinking favoring precision; it introduces MER-R1, an RL framework using dual-objective disentanglement to jointly optimize recall/precision signals, slow-fast confidence calibration, and a theoretical argument that this synergy mitigates variance-induced interference during optimization, achieving SOTA results on MER-UniBench and MME-Emotion while making reasoning beneficial for the task.

Significance. If the theoretical justification holds without circularity and the empirical results are robust, the work could advance multimodal reasoning by showing a concrete mechanism to convert complementary thinking modes into joint optimization gains rather than trade-offs, with potential implications for other reasoning-heavy vision-language tasks.

major comments (1)

[Abstract] Abstract: the central claim rests on 'theoretical justification for this synergy, showing that it mitigates variance-induced interference during optimization' via dual-objective disentanglement, yet no derivation, explicit loss formulation, variance term, or independence assumptions are supplied; without these the argument cannot be checked for non-circularity or whether the claimed separation of recall/precision signals follows from the model rather than being presupposed.

minor comments (1)

The abstract asserts SOTA performance and that 'reasoning genuinely benefit emotion recognition' but supplies no quantitative deltas, baseline comparisons, or statistical details to allow immediate assessment of effect size.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We agree that the abstract's brevity makes the theoretical claim difficult to verify and will revise it to include a concise outline of the key derivation, loss terms, and assumptions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim rests on 'theoretical justification for this synergy, showing that it mitigates variance-induced interference during optimization' via dual-objective disentanglement, yet no derivation, explicit loss formulation, variance term, or independence assumptions are supplied; without these the argument cannot be checked for non-circularity or whether the claimed separation of recall/precision signals follows from the model rather than being presupposed.

Authors: We acknowledge the validity of this observation regarding the abstract. The full manuscript contains the explicit dual-objective loss formulation (Section 3.2), the variance term derivation in Theorem 1, and the independence assumptions in the accompanying proof. These elements demonstrate that the recall/precision separation arises from the disentangled RL objectives rather than being presupposed, and that the synergy reduces variance-induced interference without circularity. To address the referee's concern, we will expand the abstract with a brief summary of these components so the central claim can be evaluated directly from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract and provided text describe empirical observations on fast vs. slow thinking, introduce dual-objective disentanglement and slow-fast calibration as optimization mechanisms, and assert a theoretical justification for variance mitigation without supplying equations, self-citations, or derivations. No load-bearing step reduces by construction to fitted inputs, self-definitions, or author-overlapping citations; the claimed synergy is presented as following from stated empirical patterns and external experiments on MER-UniBench/MME-Emotion rather than presupposed by the framework itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, parameters, or background assumptions to audit; ledger left empty.

pith-pipeline@v0.9.1-grok · 5769 in / 1091 out tokens · 19654 ms · 2026-06-29T05:04:00.057501+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 13 canonical work pages · 8 internal anchors

[1]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

MIT press, 2000

Rosalind W Picard.Affective computing. MIT press, 2000

2000
[5]

Mer 2025: When affective computing meets large language models

Zheng Lian, Rui Liu, Kele Xu, Bin Liu, Xuefei Liu, Yazhou Zhang, Xin Liu, Yong Li, Zebang Cheng, Haolin Zuo, et al. Mer 2025: When affective computing meets large language models. InACM MM, 2025

2025
[6]

Multimodal large language models meet multimodal emotion recognition and reasoning: A survey.arXiv preprint arXiv:2509.24322,

Yuntao Shou, Tao Meng, Wei Ai, and Keqin Li. Multimodal large language models meet multimodal emotion recognition and reasoning: A survey.arXiv preprint arXiv:2509.24322, 2025

work page arXiv 2025
[7]

Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning

Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mngyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jinming Zhao, et al. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. InACM MM, 2023

2023
[8]

Mer 2024: Semi-supervised learning, noise robustness, and open-vocabulary multimodal emotion recognition

Zheng Lian, Haiyang Sun, Licai Sun, Zhuofan Wen, Siyuan Zhang, Shun Chen, Hao Gu, Jinming Zhao, Ziyang Ma, Xie Chen, et al. Mer 2024: Semi-supervised learning, noise robustness, and open-vocabulary multimodal emotion recognition. InProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, 2024

2024
[9]

Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InACL, 2019

2019
[10]

Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

2008
[11]

Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models

Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, et al. Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models. InICML, 2025

2025
[12]

Ov-mer: Towards open-vocabulary multimodal emotion recognition

Zheng Lian, Haiyang Sun, Licai Sun, Haoyu Chen, Lan Chen, Hao Gu, Zhuofan Wen, Shun Chen, Siyuan Zhang, Hailiang Yao, et al. Ov-mer: Towards open-vocabulary multimodal emotion recognition. InICML, 2025

2025
[13]

Explainable multimodal emotion reasoning.CoRR, 2023

Zheng Lian, Licai Sun, Mingyu Xu, Haiyang Sun, Ke Xu, Zhuofan Wen, Shun Chen, Bin Liu, and Jianhua Tao. Explainable multimodal emotion reasoning.CoRR, 2023

2023
[14]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025
[15]

Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2508.01318, 2025

Zheng Lian, Fan Zhang, Yazhou Zhang, Jianhua Tao, Rui Liu, Haoyu Chen, and Xiaobai Li. Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2508.01318, 2025

work page arXiv 2025
[16]

Humanomniv2: From understanding to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277, 2025

Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. Humanomniv2: From understanding to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277, 2025. 10

work page arXiv 2025
[17]

From system 1 to system 2: A survey of reasoning large language models.TPAMI, 2026

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.TPAMI, 2026

2026
[18]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Visual instruction tuning.NeurIPS, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 2024

2024
[21]

Parallel diffusion solver via residual dirichlet policy optimization.IEEE TPAMI, pages 1–17, 2026

Ruoyu Wang, Ziyu Li, Beier Zhu, Liangyu Yuan, Hanwang Zhang, Xun Yang, Xiaojun Chang, and Chi Zhang. Parallel diffusion solver via residual dirichlet policy optimization.IEEE TPAMI, pages 1–17, 2026

2026
[22]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Perception-aware policy optimization for multimodal reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning. InICLR, 2026

2026
[24]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InICCV, 2025

2025
[25]

Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InNeurIPS, 2025

2025
[26]

Videoauto-r1: Video auto reasoning via thinking once, answering twice.arXiv preprint arXiv:2601.05175, 2026

Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, et al. Videoauto-r1: Video auto reasoning via thinking once, answering twice.arXiv preprint arXiv:2601.05175, 2026

work page arXiv 2026
[27]

Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.NeurIPS, 2024

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander Hauptmann. Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.NeurIPS, 2024

2024
[28]

Benchmarking and bridging emotion conflicts for multimodal emotion reasoning

Zhiyuan Han, Beier Zhu, Yanlong Xu, Peipei Song, and Xun Yang. Benchmarking and bridging emotion conflicts for multimodal emotion reasoning. InACM MM, 2025

2025
[29]

Mme-emotion: A holistic evaluation benchmark for emotional intelligence in multimodal large language models

Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, et al. Mme-emotion: A holistic evaluation benchmark for emotional intelligence in multimodal large language models. InICLR, 2026

2026
[30]

R1-omni: Explainable omni-multimodal emotion recognition with reinforcing learning.arXiv preprint arXiv:2503.05379, 2025

Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Explainable omni-multimodal emotion recognition with reinforcing learning.arXiv preprint arXiv:2503.05379, 2025

work page arXiv 2025
[31]

Emotion-coherent reasoning for multimodal llms via emotional rationale verifier

Hyeongseop Rha, Jeong Hun Yeo, Yeonju Kim, and Yong Man Ro. Emotion-coherent reasoning for multimodal llms via emotional rationale verifier. InAAAI, 2026

2026
[32]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Salmonn: Towards generic hearing abilities for large language models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models. In ICLR, 2024

2024
[34]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, 2024

2024
[35]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024

2024
[36]

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024

2024
[37]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Pandagpt: One model to instruction-follow them all

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. InTLLM, 2023

2023
[39]

Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. InCoRR, 2016

2016
[40]

Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. InACL, 2018

2018
[41]

Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality

Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. InACL, 2020

2020
[42]

Make acoustic and visual cues matter: Ch-sims v2

Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqiuyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, and Kai Gao. Make acoustic and visual cues matter: Ch-sims v2. 0 dataset and av-mixup consistent module. InICMI, 2022

2022
[43]

Unsupervised visual chain-of-thought reasoning via preference optimization

Kesen Zhao, Beier Zhu, Qianru Sun, and Hanwang Zhang. Unsupervised visual chain-of-thought reasoning via preference optimization. InICCV, 2025

2025
[44]

I’ll tell you, it’s not easy for a woman who has divorced and is raising a child to find a partner

Xingyu Zhu, Kesen Zhao, Liang Yi, Shuo Wang, Zhicai Wang, Beier Zhu, Hanwang Zhang, and Xiangnan He. Look carefully: Adaptive visual reinforcements in multimodal large language models for hallucination mitigation. InICLR, 2026. 12 Appendix A Theoretical analysis of dual-objective disentanglement We provide a theoretical analysis of the dual-objective dise...

2026

[1] [1]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

MIT press, 2000

Rosalind W Picard.Affective computing. MIT press, 2000

2000

[5] [5]

Mer 2025: When affective computing meets large language models

Zheng Lian, Rui Liu, Kele Xu, Bin Liu, Xuefei Liu, Yazhou Zhang, Xin Liu, Yong Li, Zebang Cheng, Haolin Zuo, et al. Mer 2025: When affective computing meets large language models. InACM MM, 2025

2025

[6] [6]

Multimodal large language models meet multimodal emotion recognition and reasoning: A survey.arXiv preprint arXiv:2509.24322,

Yuntao Shou, Tao Meng, Wei Ai, and Keqin Li. Multimodal large language models meet multimodal emotion recognition and reasoning: A survey.arXiv preprint arXiv:2509.24322, 2025

work page arXiv 2025

[7] [7]

Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning

Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mngyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jinming Zhao, et al. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. InACM MM, 2023

2023

[8] [8]

Mer 2024: Semi-supervised learning, noise robustness, and open-vocabulary multimodal emotion recognition

Zheng Lian, Haiyang Sun, Licai Sun, Zhuofan Wen, Siyuan Zhang, Shun Chen, Hao Gu, Jinming Zhao, Ziyang Ma, Xie Chen, et al. Mer 2024: Semi-supervised learning, noise robustness, and open-vocabulary multimodal emotion recognition. InProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, 2024

2024

[9] [9]

Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InACL, 2019

2019

[10] [10]

Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

2008

[11] [11]

Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models

Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, et al. Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models. InICML, 2025

2025

[12] [12]

Ov-mer: Towards open-vocabulary multimodal emotion recognition

Zheng Lian, Haiyang Sun, Licai Sun, Haoyu Chen, Lan Chen, Hao Gu, Zhuofan Wen, Shun Chen, Siyuan Zhang, Hailiang Yao, et al. Ov-mer: Towards open-vocabulary multimodal emotion recognition. InICML, 2025

2025

[13] [13]

Explainable multimodal emotion reasoning.CoRR, 2023

Zheng Lian, Licai Sun, Mingyu Xu, Haiyang Sun, Ke Xu, Zhuofan Wen, Shun Chen, Bin Liu, and Jianhua Tao. Explainable multimodal emotion reasoning.CoRR, 2023

2023

[14] [14]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025

[15] [15]

Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2508.01318, 2025

Zheng Lian, Fan Zhang, Yazhou Zhang, Jianhua Tao, Rui Liu, Haoyu Chen, and Xiaobai Li. Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2508.01318, 2025

work page arXiv 2025

[16] [16]

Humanomniv2: From understanding to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277, 2025

Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. Humanomniv2: From understanding to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277, 2025. 10

work page arXiv 2025

[17] [17]

From system 1 to system 2: A survey of reasoning large language models.TPAMI, 2026

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.TPAMI, 2026

2026

[18] [18]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Visual instruction tuning.NeurIPS, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 2024

2024

[21] [21]

Parallel diffusion solver via residual dirichlet policy optimization.IEEE TPAMI, pages 1–17, 2026

Ruoyu Wang, Ziyu Li, Beier Zhu, Liangyu Yuan, Hanwang Zhang, Xun Yang, Xiaojun Chang, and Chi Zhang. Parallel diffusion solver via residual dirichlet policy optimization.IEEE TPAMI, pages 1–17, 2026

2026

[22] [22]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Perception-aware policy optimization for multimodal reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning. InICLR, 2026

2026

[24] [24]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InICCV, 2025

2025

[25] [25]

Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InNeurIPS, 2025

2025

[26] [26]

Videoauto-r1: Video auto reasoning via thinking once, answering twice.arXiv preprint arXiv:2601.05175, 2026

Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, et al. Videoauto-r1: Video auto reasoning via thinking once, answering twice.arXiv preprint arXiv:2601.05175, 2026

work page arXiv 2026

[27] [27]

Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.NeurIPS, 2024

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander Hauptmann. Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.NeurIPS, 2024

2024

[28] [28]

Benchmarking and bridging emotion conflicts for multimodal emotion reasoning

Zhiyuan Han, Beier Zhu, Yanlong Xu, Peipei Song, and Xun Yang. Benchmarking and bridging emotion conflicts for multimodal emotion reasoning. InACM MM, 2025

2025

[29] [29]

Mme-emotion: A holistic evaluation benchmark for emotional intelligence in multimodal large language models

Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, et al. Mme-emotion: A holistic evaluation benchmark for emotional intelligence in multimodal large language models. InICLR, 2026

2026

[30] [30]

R1-omni: Explainable omni-multimodal emotion recognition with reinforcing learning.arXiv preprint arXiv:2503.05379, 2025

Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Explainable omni-multimodal emotion recognition with reinforcing learning.arXiv preprint arXiv:2503.05379, 2025

work page arXiv 2025

[31] [31]

Emotion-coherent reasoning for multimodal llms via emotional rationale verifier

Hyeongseop Rha, Jeong Hun Yeo, Yeonju Kim, and Yong Man Ro. Emotion-coherent reasoning for multimodal llms via emotional rationale verifier. InAAAI, 2026

2026

[32] [32]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Salmonn: Towards generic hearing abilities for large language models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models. In ICLR, 2024

2024

[34] [34]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, 2024

2024

[35] [35]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024

2024

[36] [36]

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024

2024

[37] [37]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Pandagpt: One model to instruction-follow them all

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. InTLLM, 2023

2023

[39] [39]

Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. InCoRR, 2016

2016

[40] [40]

Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. InACL, 2018

2018

[41] [41]

Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality

Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. InACL, 2020

2020

[42] [42]

Make acoustic and visual cues matter: Ch-sims v2

Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqiuyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, and Kai Gao. Make acoustic and visual cues matter: Ch-sims v2. 0 dataset and av-mixup consistent module. InICMI, 2022

2022

[43] [43]

Unsupervised visual chain-of-thought reasoning via preference optimization

Kesen Zhao, Beier Zhu, Qianru Sun, and Hanwang Zhang. Unsupervised visual chain-of-thought reasoning via preference optimization. InICCV, 2025

2025

[44] [44]

I’ll tell you, it’s not easy for a woman who has divorced and is raising a child to find a partner

Xingyu Zhu, Kesen Zhao, Liang Yi, Shuo Wang, Zhicai Wang, Beier Zhu, Hanwang Zhang, and Xiangnan He. Look carefully: Adaptive visual reinforcements in multimodal large language models for hallucination mitigation. InICLR, 2026. 12 Appendix A Theoretical analysis of dual-objective disentanglement We provide a theoretical analysis of the dual-objective dise...

2026