pith. sign in

arxiv: 2606.27652 · v1 · pith:TECFUADYnew · submitted 2026-06-26 · 💻 cs.AI

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

Pith reviewed 2026-06-29 05:04 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal emotion recognitionslow-fast thinkingreinforcement learningdual-objective optimizationconfidence calibrationreasoning modelsemotion classification
0
0 comments X

The pith

MER-R1 uses reinforcement learning to combine fast and slow thinking so explicit reasoning improves multimodal emotion recognition accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that explicit reasoning in multimodal models often fails to raise emotion recognition accuracy even when it increases interpretability. Fast thinking produces broader, higher-recall predictions while slow thinking produces more selective, higher-precision ones. MER-R1 turns this observed complementarity into a single training procedure by disentangling the two signals into separate optimization objectives and then calibrating the slow answer against the fast intuition. The result is joint gains rather than a trade-off, with state-of-the-art numbers on the MER-UniBench and MME-Emotion benchmarks. A reader would care because the work shows how to make deliberative reasoning an active contributor to accuracy instead of an optional post-hoc explanation.

Core claim

MER-R1 is a reinforcement learning framework that unifies the recall-oriented intuition of fast thinking with the precision-oriented selectivity of slow thinking. It does so through dual-objective disentanglement that treats recall and precision as independent optimization signals and through slow-fast confidence calibration that strengthens correct emotion categories while suppressing incorrect ones. The paper supplies a theoretical argument that this synergy reduces variance-induced interference during optimization.

What carries the argument

Dual-objective disentanglement that separates recall and precision into independent optimization signals, paired with slow-fast confidence calibration that aligns the final slow-thinking answer with fast-thinking intuition.

If this is right

  • Recall and precision can be optimized together rather than traded off during training.
  • The calibrated slow-thinking output strengthens correct emotions and weakens incorrect ones.
  • Reasoning steps contribute directly to higher accuracy on emotion recognition tasks.
  • The same slow-fast structure yields state-of-the-art results on both MER-UniBench and MME-Emotion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same disentanglement technique might be tested on other multimodal tasks that mix intuitive and deliberative reasoning.
  • Variance reduction during optimization could be measured directly to check whether it explains the observed gains.
  • The framework could be applied to non-emotion domains such as visual question answering to see if the recall-precision separation generalizes.

Load-bearing premise

Separating recall and precision into independent optimization signals allows joint improvement without hidden trade-offs between them.

What would settle it

An experiment in which the dual-objective version on MER-UniBench shows lower recall, lower precision, or higher variance than single-objective fast or slow baselines would falsify the claimed synergy.

Figures

Figures reproduced from arXiv: 2606.27652 by Beier Zhu, Chengwei Qin, Dongchuan Ran, Hewei Guo, Jiangnan Chen, Jiayu Zhang, Lewei Lu, Wenwen Tong, Xinyi Wang, Xun Yang, Zhiyuan Han.

Figure 1
Figure 1. Figure 1: Thinking paradox and slow-fast synergy in MER. (a) Motivating example: slow thinking is cautious but may miss valid emotions, whereas fast thinking improves coverage but introduces noise. Slow-fast synergy combines their complementary strengths. (b) Evaluation summary on 9 datasets: Fast thinking counter-intuitively outperforms slow thinking. To understand this paradox, we analyze fast and slow thinking in… view at source ↗
Figure 2
Figure 2. Figure 2: Emotion wheel. Given a multimodal input x, the policy πθ generates an output y = (y cot, yans), where y cot denotes the chain-of-thought rea￾soning trace and y ans denotes the final emotion answer. Slow thinking follows the standard think-then-answer format, while fast thinking is elicited with an answer-only prompt that directly produces y ans without reasoning, i.e., y cot = ∅. Since both the model answe… view at source ↗
Figure 3
Figure 3. Figure 3: Analyses of the thinking paradox on five MER benchmarks. (a) Performance gaps in recall (R), precision (P), and F1 (F), where ∆M = Mfast − Mslow for M ∈ {R, P, F}. (b) Prediction diversity by the average number of emotion categories. (c) Confidence on ground-truth emotion categories. (d) Confidence margin between ground-truth and hard-negative emotion categories. results show a clear difference in predicti… view at source ↗
Figure 4
Figure 4. Figure 4: Framework of MER-R1. Our MER-R1 consists of dual-objective disentanglement for Desideratum I and slow-fast confidence calibration for Desideratum II. 3.4 Desiderata for slow-fast thinking synergy These observations suggest that an effective emotion reasoning model should not simply prefer one thinking mode over the other. Instead, it should satisfy two desiderata. • Desideratum I: At the prediction level, … view at source ↗
Figure 5
Figure 5. Figure 5: Effect of slow-fast thinking synergy (a) Recall. (b) Precision. (c) Confidence on ground-truth categories. (d) Confidence margin. MER-R1 improves recall, preserves precision, boosts confidence on ground-truth categories, and strengthens confidence discrimination. strongest reproduced baseline by 5.63 points. The improvement is consistent across fine-grained emotion recognition, basic emotion recognition, a… view at source ↗
Figure 6
Figure 6. Figure 6: Emotion wheels used in MER-UniBench evaluation. We adopt five emotion wheels (W1–W5) to map free-form emotion words into level-1 emotion categories. For the k-th wheel wk, the complete mapping function is defined as: Gwk (·) = F wk l3 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Retained hitrate comparision. tasks cover both controlled and in-the-wild scenarios, and evaluate different aspects of affective under￾standing, including categorical emotion recognition, fine-grained affective interpretation, sentiment polarity, and intent inference. Evaluation metrics. MME-Emotion reports three metrics. Recognition measures whether the final predicted emotion or affective label is correc… view at source ↗
read the original abstract

We find that explicit reasoning does not necessarily translate into better multimodal emotion recognition (MER) accuracy, even though it makes predictions more interpretable. Specifically, for reasoning-based MLLMs, fast thinking by triggering direct answers often outperforms slow thinking after deliberative reasoning. Our empirical analyses show that fast thinking improves recall with broader and more confident predictions, whereas slow thinking favors precision through conservative filtering of incorrect categories. Building on these insights, we propose MER-R1, a reinforcement learning framework that turns slow-fast complementarity into explicit optimization. Dual-objective disentanglement separates recall and precision into two optimization signals, allowing them to be jointly optimized rather than traded off against each other. Slow-fast confidence calibration further aligns the final slow-thinking answer with fast-thinking intuition, strengthening correct emotions while suppressing incorrect ones. In this way, MER-R1 unifies the recall-oriented intuition of fast thinking with the precision-oriented selectivity of slow thinking. We further provide theoretical justification for this synergy, showing that it mitigates variance-induced interference during optimization. Extensive experiments on MER-UniBench and MME-Emotion show that MER-R1 achieves state-of-the-art performance and makes reasoning genuinely benefit emotion recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that explicit reasoning in multimodal emotion recognition (MER) does not always improve accuracy, with fast thinking favoring recall and slow thinking favoring precision; it introduces MER-R1, an RL framework using dual-objective disentanglement to jointly optimize recall/precision signals, slow-fast confidence calibration, and a theoretical argument that this synergy mitigates variance-induced interference during optimization, achieving SOTA results on MER-UniBench and MME-Emotion while making reasoning beneficial for the task.

Significance. If the theoretical justification holds without circularity and the empirical results are robust, the work could advance multimodal reasoning by showing a concrete mechanism to convert complementary thinking modes into joint optimization gains rather than trade-offs, with potential implications for other reasoning-heavy vision-language tasks.

major comments (1)
  1. [Abstract] Abstract: the central claim rests on 'theoretical justification for this synergy, showing that it mitigates variance-induced interference during optimization' via dual-objective disentanglement, yet no derivation, explicit loss formulation, variance term, or independence assumptions are supplied; without these the argument cannot be checked for non-circularity or whether the claimed separation of recall/precision signals follows from the model rather than being presupposed.
minor comments (1)
  1. The abstract asserts SOTA performance and that 'reasoning genuinely benefit emotion recognition' but supplies no quantitative deltas, baseline comparisons, or statistical details to allow immediate assessment of effect size.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We agree that the abstract's brevity makes the theoretical claim difficult to verify and will revise it to include a concise outline of the key derivation, loss terms, and assumptions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim rests on 'theoretical justification for this synergy, showing that it mitigates variance-induced interference during optimization' via dual-objective disentanglement, yet no derivation, explicit loss formulation, variance term, or independence assumptions are supplied; without these the argument cannot be checked for non-circularity or whether the claimed separation of recall/precision signals follows from the model rather than being presupposed.

    Authors: We acknowledge the validity of this observation regarding the abstract. The full manuscript contains the explicit dual-objective loss formulation (Section 3.2), the variance term derivation in Theorem 1, and the independence assumptions in the accompanying proof. These elements demonstrate that the recall/precision separation arises from the disentangled RL objectives rather than being presupposed, and that the synergy reduces variance-induced interference without circularity. To address the referee's concern, we will expand the abstract with a brief summary of these components so the central claim can be evaluated directly from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract and provided text describe empirical observations on fast vs. slow thinking, introduce dual-objective disentanglement and slow-fast calibration as optimization mechanisms, and assert a theoretical justification for variance mitigation without supplying equations, self-citations, or derivations. No load-bearing step reduces by construction to fitted inputs, self-definitions, or author-overlapping citations; the claimed synergy is presented as following from stated empirical patterns and external experiments on MER-UniBench/MME-Emotion rather than presupposed by the framework itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, parameters, or background assumptions to audit; ledger left empty.

pith-pipeline@v0.9.1-grok · 5769 in / 1091 out tokens · 19654 ms · 2026-06-29T05:04:00.057501+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 13 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  2. [2]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  3. [3]

    Qwen2.5-Omni Technical Report

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

  4. [4]

    MIT press, 2000

    Rosalind W Picard.Affective computing. MIT press, 2000

  5. [5]

    Mer 2025: When affective computing meets large language models

    Zheng Lian, Rui Liu, Kele Xu, Bin Liu, Xuefei Liu, Yazhou Zhang, Xin Liu, Yong Li, Zebang Cheng, Haolin Zuo, et al. Mer 2025: When affective computing meets large language models. InACM MM, 2025

  6. [6]

    Multimodal large language models meet multimodal emotion recognition and reasoning: A survey.arXiv preprint arXiv:2509.24322,

    Yuntao Shou, Tao Meng, Wei Ai, and Keqin Li. Multimodal large language models meet multimodal emotion recognition and reasoning: A survey.arXiv preprint arXiv:2509.24322, 2025

  7. [7]

    Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning

    Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mngyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jinming Zhao, et al. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. InACM MM, 2023

  8. [8]

    Mer 2024: Semi-supervised learning, noise robustness, and open-vocabulary multimodal emotion recognition

    Zheng Lian, Haiyang Sun, Licai Sun, Zhuofan Wen, Siyuan Zhang, Shun Chen, Hao Gu, Jinming Zhao, Ziyang Ma, Xie Chen, et al. Mer 2024: Semi-supervised learning, noise robustness, and open-vocabulary multimodal emotion recognition. InProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, 2024

  9. [9]

    Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InACL, 2019

  10. [10]

    Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

  11. [11]

    Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models

    Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, et al. Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models. InICML, 2025

  12. [12]

    Ov-mer: Towards open-vocabulary multimodal emotion recognition

    Zheng Lian, Haiyang Sun, Licai Sun, Haoyu Chen, Lan Chen, Hao Gu, Zhuofan Wen, Shun Chen, Siyuan Zhang, Hailiang Yao, et al. Ov-mer: Towards open-vocabulary multimodal emotion recognition. InICML, 2025

  13. [13]

    Explainable multimodal emotion reasoning.CoRR, 2023

    Zheng Lian, Licai Sun, Mingyu Xu, Haiyang Sun, Ke Xu, Zhuofan Wen, Shun Chen, Bin Liu, and Jianhua Tao. Explainable multimodal emotion reasoning.CoRR, 2023

  14. [14]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  15. [15]

    Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2508.01318, 2025

    Zheng Lian, Fan Zhang, Yazhou Zhang, Jianhua Tao, Rui Liu, Haoyu Chen, and Xiaobai Li. Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2508.01318, 2025

  16. [16]

    Humanomniv2: From understanding to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277, 2025

    Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. Humanomniv2: From understanding to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277, 2025. 10

  17. [17]

    From system 1 to system 2: A survey of reasoning large language models.TPAMI, 2026

    Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.TPAMI, 2026

  18. [18]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

  19. [19]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  20. [20]

    Visual instruction tuning.NeurIPS, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 2024

  21. [21]

    Parallel diffusion solver via residual dirichlet policy optimization.IEEE TPAMI, pages 1–17, 2026

    Ruoyu Wang, Ziyu Li, Beier Zhu, Liangyu Yuan, Hanwang Zhang, Xun Yang, Xiaojun Chang, and Chi Zhang. Parallel diffusion solver via residual dirichlet policy optimization.IEEE TPAMI, pages 1–17, 2026

  22. [22]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

  23. [23]

    Perception-aware policy optimization for multimodal reasoning

    Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning. InICLR, 2026

  24. [24]

    Visual-rft: Visual reinforcement fine-tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InICCV, 2025

  25. [25]

    Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InNeurIPS, 2025

  26. [26]

    Videoauto-r1: Video auto reasoning via thinking once, answering twice.arXiv preprint arXiv:2601.05175, 2026

    Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, et al. Videoauto-r1: Video auto reasoning via thinking once, answering twice.arXiv preprint arXiv:2601.05175, 2026

  27. [27]

    Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.NeurIPS, 2024

    Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander Hauptmann. Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.NeurIPS, 2024

  28. [28]

    Benchmarking and bridging emotion conflicts for multimodal emotion reasoning

    Zhiyuan Han, Beier Zhu, Yanlong Xu, Peipei Song, and Xun Yang. Benchmarking and bridging emotion conflicts for multimodal emotion reasoning. InACM MM, 2025

  29. [29]

    Mme-emotion: A holistic evaluation benchmark for emotional intelligence in multimodal large language models

    Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, et al. Mme-emotion: A holistic evaluation benchmark for emotional intelligence in multimodal large language models. InICLR, 2026

  30. [30]

    R1-omni: Explainable omni-multimodal emotion recognition with reinforcing learning.arXiv preprint arXiv:2503.05379, 2025

    Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Explainable omni-multimodal emotion recognition with reinforcing learning.arXiv preprint arXiv:2503.05379, 2025

  31. [31]

    Emotion-coherent reasoning for multimodal llms via emotional rationale verifier

    Hyeongseop Rha, Jeong Hun Yeo, Yeonju Kim, and Yong Man Ro. Emotion-coherent reasoning for multimodal llms via emotional rationale verifier. InAAAI, 2026

  32. [32]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023. 11

  33. [33]

    Salmonn: Towards generic hearing abilities for large language models

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models. In ICLR, 2024

  34. [34]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, 2024

  35. [35]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024

  36. [36]

    Chat-univi: Unified visual representation empowers large language models with image and video understanding

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024

  37. [37]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023

  38. [38]

    Pandagpt: One model to instruction-follow them all

    Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. InTLLM, 2023

  39. [39]

    Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos

    Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. InCoRR, 2016

  40. [40]

    Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph

    AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. InACL, 2018

  41. [41]

    Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality

    Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. InACL, 2020

  42. [42]

    Make acoustic and visual cues matter: Ch-sims v2

    Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqiuyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, and Kai Gao. Make acoustic and visual cues matter: Ch-sims v2. 0 dataset and av-mixup consistent module. InICMI, 2022

  43. [43]

    Unsupervised visual chain-of-thought reasoning via preference optimization

    Kesen Zhao, Beier Zhu, Qianru Sun, and Hanwang Zhang. Unsupervised visual chain-of-thought reasoning via preference optimization. InICCV, 2025

  44. [44]

    I’ll tell you, it’s not easy for a woman who has divorced and is raising a child to find a partner

    Xingyu Zhu, Kesen Zhao, Liang Yi, Shuo Wang, Zhicai Wang, Beier Zhu, Hanwang Zhang, and Xiangnan He. Look carefully: Adaptive visual reinforcements in multimodal large language models for hallucination mitigation. InICLR, 2026. 12 Appendix A Theoretical analysis of dual-objective disentanglement We provide a theoretical analysis of the dual-objective dise...