Omni-Perception Policy Optimization for Multimodal Emotion Reasoning

Beier Zhu; Jiangnan Chen; Lewei Lu; Peipei Song; Pengyang Shao; Wenwen Tong; Xinyi Wang; Xun Yang; Zhiyuan Han

arxiv: 2606.25325 · v1 · pith:VPJ4JGWNnew · submitted 2026-06-24 · 💻 cs.AI

Omni-Perception Policy Optimization for Multimodal Emotion Reasoning

Zhiyuan Han , Beier Zhu , Wenwen Tong , Pengyang Shao , Peipei Song , Xinyi Wang , Jiangnan Chen , Lewei Lu

show 1 more author

Xun Yang

This is my paper

Pith reviewed 2026-06-25 21:29 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal emotion reasoningreinforcement learningpolicy optimizationperception rewardcross-modal hallucinationutilization faithfulnessomni-modal modelsMEP-Bench

0 comments

The pith

OPPO applies reinforcement learning to optimize multimodal perception and reduce hallucinations in emotion reasoning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current omni-modal models for emotion reasoning underuse cues across modalities and often hallucinate by inventing details from one modality in another. The paper proposes OPPO, a reinforcement learning approach that first rewards trajectories for recovering specific visual, acoustic, and emotion cues from ground-truth reasoning. It then adds a loss that applies KL penalties only to modality-specific tokens when inputs are masked, to block cross-modal fabrication. Experiments show this yields state-of-the-art results on MER-UniBench and MME-Emotion while raising utilization and faithfulness scores on the new MEP-Bench diagnostic.

Core claim

The central claim is that an Omni-Perception Reward that decomposes ground-truth into fine-grained modality cues, combined with an Omni-Perception Loss that penalizes only modality-specific evidence tokens under masked inputs, produces models that both utilize multimodal information more fully and generate fewer unfaithful statements during emotion reasoning.

What carries the argument

Omni-Perception Reward (decomposes ground-truth into visual, acoustic, and emotion cues) paired with modality-specific KL penalty in the loss function (suppresses cross-modal hallucination on masked inputs).

If this is right

Models reach state-of-the-art accuracy on MER-UniBench and MME-Emotion benchmarks.
Utilization of multimodal cues and faithfulness of generated statements both rise measurably on MEP-Bench.
Explicit perception optimization becomes a practical lever for reliable multimodal emotion reasoning.
The introduced MEP-Bench serves as a diagnostic tool to track perception sufficiency and faithfulness separately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-plus-masked-penalty structure could be tested on non-emotion multimodal reasoning tasks such as visual question answering or video captioning.
If the approach scales, perception optimization might become a standard stage before fine-tuning on downstream multimodal tasks.
MEP-Bench style diagnostics could be applied to existing open models to measure how widespread the underutilization and hallucination problems are.
The framework assumes the ground-truth reasoning already contains cleanly separable modality cues; any ambiguity in that decomposition would limit the method's reliability.

Load-bearing premise

The reward decomposition and the targeted KL penalty correctly isolate and improve perception quality without adding new biases or overfitting to the chosen benchmarks.

What would settle it

A model trained with OPPO that shows no gains in utilization or faithfulness scores on MEP-Bench, or that continues to produce modality-specific hallucinations at the same rate as the baseline.

Figures

Figures reproduced from arXiv: 2606.25325 by Beier Zhu, Jiangnan Chen, Lewei Lu, Peipei Song, Pengyang Shao, Wenwen Tong, Xinyi Wang, Xun Yang, Zhiyuan Han.

**Figure 1.** Figure 1: Existing emotion Omni-MLLMs lack reliable perception of visual and acoustic cues. (a) Underutilization (violating the Utilization principle). An empirical case from MEP-Bench where AffectGPT-R1 (Lian et al., 2025c) overlooks fine-grained visual cues (e.g., furrowed brow, sad eyes) and mainly relies on the crying tone in the audio due to “thinking inertia”. (b) Unfaithfulness (violating the Faithfulness pri… view at source ↗

**Figure 2.** Figure 2: The framework of OPPO. Top: Fine-grained evidence is organized into video cues CV , audio cues CA, and emotion cues CE. CoT is segmented into clauses U, which are identified as video, audio, or emotion types based on their semantic content. Middle: The Omni-Perception Reward (3.1). An Evidence-Routing Matrix computes the semantic similarities between generated clauses and ground-truth cues. This encourages… view at source ↗

**Figure 3.** Figure 3: Synergistic Effect of Omni-Perception Loss on Reward. While minimal weights yield limited gains, increasing β to a sufficient magnitude (≥ 8e−3) significantly elevates the final Omni-Perception Reward. called if its maximum cosine similarity to any generated cue exceeds 0.6. (2) Faithfulness is evaluated via a POPE-style unimodal masking probe (Li et al., 2023; Ye et al., 2026). On samples with strong aud… view at source ↗

**Figure 4.** Figure 4: Emotion wheels. We adopt five emotion wheels (W1–W5) to compute the EW-based metrics. B.2. Fine-grained Emotion Recognition Fine-grained Emotion Recognition aims to predict a set of fine-grained emotion descriptors beyond basic categories. The output space is open-vocabulary, and each sample may correspond to multiple emotion labels. We evaluate this task on OV-MERD+ (Lian et al., 2025a), which extends OV-… view at source ↗

read the original abstract

We find that current emotion-oriented Omni-MLLMs still lack reliable omni-modal perception: they (i) underutilize multimodal cues in their reasoning trajectories and (ii) exhibit unfaithful behavior, often hallucinating modality-specific statements from other modalities. Building on these insights, we propose OPPO (Omni-Perception Policy Optimization), a reinforcement learning framework that explicitly optimizes multimodal perception. First, an Omni-Perception Reward decomposes ground-truth reasoning into fine-grained visual, acoustic, and emotion cues and rewards trajectories that semantically recover these cues. Second, an Omni-Perception Loss compares the policy under full and unimodally masked inputs, applying a KL penalty only to modality-specific evidence tokens to suppress cross-modal hallucination. We further introduce MEP-Bench, a diagnostic benchmark that quantifies utilization and faithfulness. Experiments show that OPPO achieves state-of-the-art performance on MER-UniBench and MME-Emotion, while substantially improving utilization and faithfulness scores on MEP-Bench, highlighting the importance of sufficient and faithful omni perception for multimodal emotion reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OPPO applies RL with cue-decomposed rewards and masked KL penalties to reduce under-utilization and cross-modal hallucinations in emotion reasoning, but the evaluation hinges on an author-introduced benchmark.

read the letter

The paper identifies that current omni-MLLMs under-use multimodal cues and hallucinate modality-specific claims in emotion reasoning. OPPO counters this with a policy optimization that rewards trajectories recovering fine-grained visual, acoustic, and emotion cues from ground truth, plus a KL term that penalizes only modality-specific tokens when inputs are masked. They also release MEP-Bench to score utilization and faithfulness.

This framing is new in its targeted combination for the emotion task and gives a direct handle on the failure modes. The reported gains on MER-UniBench and MME-Emotion plus the diagnostic scores on MEP-Bench show the method moves the metrics they care about.

The main soft spots are that MEP-Bench comes from the same authors, so the risk of benchmark-specific tuning is real and needs checking. The abstract supplies no equations, training details, or ablations, which leaves open whether the cue decomposition introduces its own biases or whether the masked KL cleanly isolates evidence. Without those, the soundness of the central claims stays provisional.

The work is aimed at researchers building or evaluating multimodal models for affective tasks. A reader already working on RL for perception in MLLMs could pick up the reward and penalty design.

It deserves peer review because the problem is concrete and the proposed fix is testable, even though the experiments will require close scrutiny on independence and implementation.

Referee Report

2 major / 0 minor

Summary. The paper identifies that current emotion-oriented Omni-MLLMs underutilize multimodal cues and exhibit cross-modal hallucinations in reasoning. It proposes OPPO, a reinforcement learning framework consisting of an Omni-Perception Reward that decomposes ground-truth trajectories into fine-grained visual, acoustic, and emotion cues, and an Omni-Perception Loss that applies a modality-specific KL penalty on evidence tokens under full versus masked inputs. The authors introduce MEP-Bench to quantify utilization and faithfulness, and report that OPPO achieves SOTA results on MER-UniBench and MME-Emotion while substantially improving scores on MEP-Bench.

Significance. If the performance gains prove robust and independent of benchmark construction, the work would provide a concrete mechanism for improving perception quality in multimodal emotion models, along with a diagnostic benchmark that could see wider adoption. The explicit decomposition of perception into modality-specific cues and the targeted KL penalty represent a focused contribution over generic RLHF approaches.

major comments (2)

[MEP-Bench definition and experimental results] The central claim of 'substantially improving utilization and faithfulness scores on MEP-Bench' is load-bearing, yet MEP-Bench is introduced by the same authors to measure precisely the quantities optimized by the Omni-Perception Reward and Loss. Without an independent construction protocol or external validation set, the reported gains risk circularity; the manuscript must demonstrate that benchmark items and scoring rules were fixed prior to method development and are not post-hoc tuned to OPPO trajectories.
[Omni-Perception Loss formulation] The Omni-Perception Loss applies a KL penalty only to modality-specific evidence tokens under unimodal masking. The exact token identification procedure, the mathematical form of the penalty (including any temperature or weighting hyperparameters), and ablations showing that this selective penalty does not simply suppress all non-visual tokens or introduce new modality biases are required; these details are load-bearing for the claim that the loss 'suppresses cross-modal hallucination' without side effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [MEP-Bench definition and experimental results] The central claim of 'substantially improving utilization and faithfulness scores on MEP-Bench' is load-bearing, yet MEP-Bench is introduced by the same authors to measure precisely the quantities optimized by the Omni-Perception Reward and Loss. Without an independent construction protocol or external validation set, the reported gains risk circularity; the manuscript must demonstrate that benchmark items and scoring rules were fixed prior to method development and are not post-hoc tuned to OPPO trajectories.

Authors: We agree that demonstrating independence between benchmark construction and method development is essential to address potential circularity. In the revised manuscript we will add a dedicated subsection detailing the MEP-Bench construction timeline, item selection criteria, scoring rubric, and inter-annotator validation protocol, explicitly stating that these elements were finalized before OPPO training began. We will also report any external validation steps performed (e.g., correlation with existing emotion benchmarks) to further substantiate that the benchmark was not tuned post-hoc to OPPO trajectories. revision: yes
Referee: [Omni-Perception Loss formulation] The Omni-Perception Loss applies a KL penalty only to modality-specific evidence tokens under unimodal masking. The exact token identification procedure, the mathematical form of the penalty (including any temperature or weighting hyperparameters), and ablations showing that this selective penalty does not simply suppress all non-visual tokens or introduce new modality biases are required; these details are load-bearing for the claim that the loss 'suppresses cross-modal hallucination' without side effects.

Authors: We acknowledge that the current manuscript lacks sufficient implementation detail on the loss. In revision we will expand the methodology section to include: (i) the exact algorithm and criteria used to identify modality-specific evidence tokens, (ii) the complete mathematical expression of the Omni-Perception Loss with all hyperparameters (temperature, weighting coefficients, etc.), and (iii) new ablation experiments that measure the loss's effect on visual versus non-visual tokens and across modalities to rule out unintended suppression or bias. These additions will directly support the claim that the penalty targets cross-modal hallucination without the side effects noted. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context describe OPPO's reward decomposition and loss, plus introduction of MEP-Bench to measure utilization/faithfulness, with SOTA claims on the independent external benchmarks MER-UniBench and MME-Emotion. No equations, self-citations, or derivation steps are supplied that reduce any claimed result to its own inputs by construction (e.g., no fitted parameter renamed as prediction, no self-definitional loop, no load-bearing self-citation chain). The central performance claims on external benchmarks remain independent of the new benchmark's construction, so the derivation is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on standard RL assumptions plus the unverified claim that the proposed reward and loss isolate perception quality; no free parameters or invented physical entities are described.

axioms (2)

domain assumption Ground-truth reasoning trajectories can be reliably decomposed into modality-specific cues that serve as an objective reward signal.
Invoked in the definition of the Omni-Perception Reward.
domain assumption KL divergence applied only to modality-specific tokens under masked inputs suppresses hallucination without harming overall performance.
Central to the Omni-Perception Loss design.

invented entities (1)

MEP-Bench no independent evidence
purpose: Diagnostic benchmark quantifying utilization and faithfulness of multimodal cues.
New benchmark introduced to evaluate the method.

pith-pipeline@v0.9.1-grok · 5736 in / 1393 out tokens · 23727 ms · 2026-06-25T21:29:47.032204+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

211 extracted references · 2 canonical work pages

[1]

The Fourteenth International Conference on Learning Representations , year=

Real-Time Motion-Controllable Autoregressive Video Diffusion , author=. The Fourteenth International Conference on Learning Representations , year=
[2]

2026 , volume=

Wang, Ruoyu and Li, Ziyu and Zhu, Beier and Yuan, Liangyu and Zhang, Hanwang and Yang, Xun and Chang, Xiaojun and Zhang, Chi , journal=. 2026 , volume=. doi:10.1109/TPAMI.2026.3692227 , url =

work page doi:10.1109/tpami.2026.3692227 2026
[3]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Zhao, Kesen and Zhu, Beier and Sun, Qianru and Zhang, Hanwang , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025
[4]

Proceedings of the ACM Web Conference 2026 , pages=

From Social Media to Psychological Scale: An Adaptive Framework with Two-Hop Retrieval for Depression Screening , author=. Proceedings of the ACM Web Conference 2026 , pages=

2026
[5]

arXiv preprint arXiv:2508.10848 , year=

Psyche-r1: Towards reliable psychological llms through unified empathy, expertise, and reasoning , author=. arXiv preprint arXiv:2508.10848 , year=

arXiv
[6]

Advances in Neural Information Processing Systems , volume=

Affordbot: 3d fine-grained embodied reasoning via multimodal large language models , author=. Advances in Neural Information Processing Systems , volume=
[7]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Augrefer: Advancing 3d visual grounding via cross-modal augmentation and spatial relation-based referring , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[8]

ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Stimuli-Aware Emotion Adaptor for Enhancing LLM in Affective Explanation Captioning , author=. ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2026 , organization=

2026
[9]

Emotional Video Captioning With Vision-Based Emotion Interpretation Network , year=

Song, Peipei and Guo, Dan and Yang, Xun and Tang, Shengeng and Wang, Meng , journal=. Emotional Video Captioning With Vision-Based Emotion Interpretation Network , year=
[10]

Proceedings of the 33rd ACM International Conference on Multimedia , pages=

Multi-round Mutual Emotion-Cause Pair Extraction for Emotion-Attributed Video Captioning , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=
[11]

arXiv preprint arXiv:2410.21276 , year=

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

Pith/arXiv arXiv
[12]

ICLR , year=

Mme-emotion: A holistic evaluation benchmark for emotional intelligence in multimodal large language models , author=. ICLR , year=
[13]

arXiv preprint arXiv:2601.03267 , year=

OpenAI GPT-5 System Card , author=. arXiv preprint arXiv:2601.03267 , year=

Pith/arXiv arXiv
[14]

5-omni technical report , author=

Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

Pith/arXiv arXiv
[15]

The Bell system technical journal , volume=

A mathematical theory of communication , author=. The Bell system technical journal , volume=. 1948 , publisher=

1948
[16]

ArXiv , year=

Qwen2.5 Technical Report , author=. ArXiv , year=
[17]

arXiv preprint arXiv:2509.17765 , year=

Qwen3-Omni Technical Report , author=. arXiv preprint arXiv:2509.17765 , year=

Pith/arXiv arXiv
[18]

2000 , publisher=

Affective computing , author=. 2000 , publisher=

2000
[19]

arXiv preprint arXiv:2509.24322 , year=

Multimodal large language models meet multimodal emotion recognition and reasoning: A survey , author=. arXiv preprint arXiv:2509.24322 , year=

arXiv
[20]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

2025
[21]

arXiv preprint arXiv:2506.21277 , year=

HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context , author=. arXiv preprint arXiv:2506.21277 , year=

arXiv
[22]

arXiv preprint arXiv:2508.01318 , year=

AffectGPT-R1: Leveraging Reinforcement Learning for Open-Vocabulary Multimodal Emotion Recognition , author=. arXiv preprint arXiv:2508.01318 , year=

arXiv
[23]

ACM MM , year=

Benchmarking and bridging emotion conflicts for multimodal emotion reasoning , author=. ACM MM , year=
[24]

EMNLP , year=

Evaluating object hallucination in large vision-language models , author=. EMNLP , year=
[25]

arXiv preprint arXiv:2410.12787 , year=

The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio , author=. arXiv preprint arXiv:2410.12787 , year=

arXiv
[26]

AAAI , year=

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? , author=. AAAI , year=
[27]

NeurIPS , year=

Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning , author=. NeurIPS , year=
[28]

arXiv preprint arXiv:2411.10442 , year=

Enhancing the reasoning ability of multimodal large language models via mixed preference optimization , author=. arXiv preprint arXiv:2411.10442 , year=

Pith/arXiv arXiv
[29]

ICLR2025 , year=

Mask-dpo: Generalizable fine-grained factuality alignment of llms , author=. ICLR2025 , year=
[30]

ICLR , year=

Perception-aware policy optimization for multimodal reasoning , author=. ICLR , year=
[31]

arXiv preprint arXiv:2503.14476 , year=

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2503.07365 , year=

Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning , author=. arXiv preprint arXiv:2503.07365 , year=

Pith/arXiv arXiv
[33]

ICCV , year=

Visual-RFT: Visual Reinforcement Fine-Tuning , author=. ICCV , year=
[34]

arXiv preprint arXiv:2507.18071 , year=

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

Pith/arXiv arXiv
[35]

ACM MM , year=

Mer 2025: When affective computing meets large language models , author=. ACM MM , year=

2025
[36]

arXiv preprint arXiv:2506.05176 , year=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=

Pith/arXiv arXiv
[37]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Meld: A multimodal multi-party dataset for emotion recognition in conversations , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=
[38]

Language resources and evaluation , volume=

IEMOCAP: Interactive emotional dyadic motion capture database , author=. Language resources and evaluation , volume=. 2008 , publisher=

2008
[39]

ACM MM , year=

Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild , author=. ACM MM , year=
[40]

CoRR , year=

Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos , author=. CoRR , year=
[41]

ACL , year=

Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph , author=. ACL , year=
[42]

ACL , year=

Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality , author=. ACL , year=
[43]

0 dataset and av-mixup consistent module , author=

Make acoustic and visual cues matter: Ch-sims v2. 0 dataset and av-mixup consistent module , author=. ICMI , year=
[44]

IEEE Transactions on Affective Computing , volume=

Crema-d: Crowd-sourced emotional multimodal actors dataset , author=. IEEE Transactions on Affective Computing , volume=. 2014 , publisher=

2014
[45]

IEEE Transactions on Affective Computing , volume=

Mgeed: A multimodal genuine emotion and expression detection database , author=. IEEE Transactions on Affective Computing , volume=. 2023 , publisher=

2023
[46]

ACM MM , year=

Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning , author=. ACM MM , year=

2023
[47]

Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing , year=

Mer 2024: Semi-supervised learning, noise robustness, and open-vocabulary multimodal emotion recognition , author=. Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing , year=

2024
[48]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Merbench: A unified evaluation benchmark for multimodal emotion recognition , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[49]

arXiv preprint arXiv:2412.19437 , year=

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

Pith/arXiv arXiv
[50]

CoRR , year=

Explainable multimodal emotion reasoning , author=. CoRR , year=
[51]

ICML , year=

Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models , author=. ICML , year=
[52]

NeurIPS , year=

Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning , author=. NeurIPS , year=
[53]

arXiv preprint arXiv:2501.09502 , year=

Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis , author=. arXiv preprint arXiv:2501.09502 , year=

arXiv
[54]

arXiv preprint arXiv:2501.15111 , year=

HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding , author=. arXiv preprint arXiv:2501.15111 , year=

arXiv
[55]

arXiv preprint arXiv:2503.05379 , year=

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning , author=. arXiv preprint arXiv:2503.05379 , year=

arXiv
[56]

arXiv preprint arXiv:2511.18437 , year=

Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning , author=. arXiv preprint arXiv:2511.18437 , year=

arXiv
[57]

arXiv preprint arXiv:2509.21854 , year=

Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization , author=. arXiv preprint arXiv:2509.21854 , year=

arXiv
[58]

arXiv preprint arXiv:2506.07218 , year=

Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward , author=. arXiv preprint arXiv:2506.07218 , year=

arXiv
[59]

Nature Machine Intelligence , volume=

Parameter-efficient fine-tuning of large-scale pre-trained language models , author=. Nature Machine Intelligence , volume=
[60]

arXiv preprint arXiv:2403.14608 , year=

Parameter-efficient fine-tuning for large models: A comprehensive survey , author=. arXiv preprint arXiv:2403.14608 , year=

Pith/arXiv arXiv
[61]

ICLR , year=

Lora: Low-rank adaptation of large language models , author=. ICLR , year=
[62]

arXiv preprint arXiv:2312.12379 , year=

Mixture of cluster-conditional lora experts for vision-language instruction tuning , author=. arXiv preprint arXiv:2312.12379 , year=

arXiv
[63]

NeurIPS , year=

Hydralora: An asymmetric lora architecture for efficient fine-tuning , author=. NeurIPS , year=
[64]

ACM MM , year=

Dfew: A large-scale database for recognizing dynamic facial expressions in the wild , author=. ACM MM , year=
[65]

CVPR , year=

Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training , author=. CVPR , year=
[66]

arXiv preprint arXiv:2407.21770 , year=

Moma: Efficient early-fusion pre-training with mixture of modality-aware experts , author=. arXiv preprint arXiv:2407.21770 , year=

arXiv
[67]

CVPR , year=

Image as a foreign language: Beit pretraining for vision and vision-language tasks , author=. CVPR , year=
[68]

Findings of EMNLP , year=

Scaling vision-language models with sparse mixture of experts , author=. Findings of EMNLP , year=
[69]

CVPR , year=

Omni-smola: Boosting generalist multimodal models with soft mixture of low-rank experts , author=. CVPR , year=
[70]

ACM computing surveys , volume=

Survey of hallucination in natural language generation , author=. ACM computing surveys , volume=
[71]

arXiv preprint arXiv:2309.05922 , year=

A survey of hallucination in large foundation models , author=. arXiv preprint arXiv:2309.05922 , year=

Pith/arXiv arXiv
[72]

arXiv preprint arXiv:2402.00253 , year=

A survey on hallucination in large vision-language models , author=. arXiv preprint arXiv:2402.00253 , year=

Pith/arXiv arXiv
[73]

CVPR , year=

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation , author=. CVPR , year=
[74]

CVPR , year=

Mitigating object hallucinations in large vision-language models through visual contrastive decoding , author=. CVPR , year=
[75]

arXiv preprint arXiv:2411.15839 , year=

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding , author=. arXiv preprint arXiv:2411.15839 , year=

arXiv
[76]

arXiv preprint arXiv:2411.09968 , year=

Seeing clearly by layer two: Enhancing attention heads to alleviate hallucination in lvlms , author=. arXiv preprint arXiv:2411.09968 , year=

arXiv
[77]

ICLR , year=

Decoupled weight decay regularization , author=. ICLR , year=
[78]

arXiv preprint arXiv:2303.15389 , year=

Eva-clip: Improved training techniques for clip at scale , author=. arXiv preprint arXiv:2303.15389 , year=

Pith/arXiv arXiv
[79]

EMNLP , year=

Video-llava: Learning united visual representation by alignment before projection , author=. EMNLP , year=
[80]

CVPR , year=

Onellm: One framework to align all modalities with language , author=. CVPR , year=

Showing first 80 references.

[1] [1]

The Fourteenth International Conference on Learning Representations , year=

Real-Time Motion-Controllable Autoregressive Video Diffusion , author=. The Fourteenth International Conference on Learning Representations , year=

[2] [2]

2026 , volume=

Wang, Ruoyu and Li, Ziyu and Zhu, Beier and Yuan, Liangyu and Zhang, Hanwang and Yang, Xun and Chang, Xiaojun and Zhang, Chi , journal=. 2026 , volume=. doi:10.1109/TPAMI.2026.3692227 , url =

work page doi:10.1109/tpami.2026.3692227 2026

[3] [3]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Zhao, Kesen and Zhu, Beier and Sun, Qianru and Zhang, Hanwang , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025

[4] [4]

Proceedings of the ACM Web Conference 2026 , pages=

From Social Media to Psychological Scale: An Adaptive Framework with Two-Hop Retrieval for Depression Screening , author=. Proceedings of the ACM Web Conference 2026 , pages=

2026

[5] [5]

arXiv preprint arXiv:2508.10848 , year=

Psyche-r1: Towards reliable psychological llms through unified empathy, expertise, and reasoning , author=. arXiv preprint arXiv:2508.10848 , year=

arXiv

[6] [6]

Advances in Neural Information Processing Systems , volume=

Affordbot: 3d fine-grained embodied reasoning via multimodal large language models , author=. Advances in Neural Information Processing Systems , volume=

[7] [7]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Augrefer: Advancing 3d visual grounding via cross-modal augmentation and spatial relation-based referring , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[8] [8]

ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Stimuli-Aware Emotion Adaptor for Enhancing LLM in Affective Explanation Captioning , author=. ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2026 , organization=

2026

[9] [9]

Emotional Video Captioning With Vision-Based Emotion Interpretation Network , year=

Song, Peipei and Guo, Dan and Yang, Xun and Tang, Shengeng and Wang, Meng , journal=. Emotional Video Captioning With Vision-Based Emotion Interpretation Network , year=

[10] [10]

Proceedings of the 33rd ACM International Conference on Multimedia , pages=

Multi-round Mutual Emotion-Cause Pair Extraction for Emotion-Attributed Video Captioning , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

[11] [11]

arXiv preprint arXiv:2410.21276 , year=

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

Pith/arXiv arXiv

[12] [12]

ICLR , year=

Mme-emotion: A holistic evaluation benchmark for emotional intelligence in multimodal large language models , author=. ICLR , year=

[13] [13]

arXiv preprint arXiv:2601.03267 , year=

OpenAI GPT-5 System Card , author=. arXiv preprint arXiv:2601.03267 , year=

Pith/arXiv arXiv

[14] [14]

5-omni technical report , author=

Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

Pith/arXiv arXiv

[15] [15]

The Bell system technical journal , volume=

A mathematical theory of communication , author=. The Bell system technical journal , volume=. 1948 , publisher=

1948

[16] [16]

ArXiv , year=

Qwen2.5 Technical Report , author=. ArXiv , year=

[17] [17]

arXiv preprint arXiv:2509.17765 , year=

Qwen3-Omni Technical Report , author=. arXiv preprint arXiv:2509.17765 , year=

Pith/arXiv arXiv

[18] [18]

2000 , publisher=

Affective computing , author=. 2000 , publisher=

2000

[19] [19]

arXiv preprint arXiv:2509.24322 , year=

Multimodal large language models meet multimodal emotion recognition and reasoning: A survey , author=. arXiv preprint arXiv:2509.24322 , year=

arXiv

[20] [20]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

2025

[21] [21]

arXiv preprint arXiv:2506.21277 , year=

HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context , author=. arXiv preprint arXiv:2506.21277 , year=

arXiv

[22] [22]

arXiv preprint arXiv:2508.01318 , year=

AffectGPT-R1: Leveraging Reinforcement Learning for Open-Vocabulary Multimodal Emotion Recognition , author=. arXiv preprint arXiv:2508.01318 , year=

arXiv

[23] [23]

ACM MM , year=

Benchmarking and bridging emotion conflicts for multimodal emotion reasoning , author=. ACM MM , year=

[24] [24]

EMNLP , year=

Evaluating object hallucination in large vision-language models , author=. EMNLP , year=

[25] [25]

arXiv preprint arXiv:2410.12787 , year=

The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio , author=. arXiv preprint arXiv:2410.12787 , year=

arXiv

[26] [26]

AAAI , year=

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? , author=. AAAI , year=

[27] [27]

NeurIPS , year=

Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning , author=. NeurIPS , year=

[28] [28]

arXiv preprint arXiv:2411.10442 , year=

Enhancing the reasoning ability of multimodal large language models via mixed preference optimization , author=. arXiv preprint arXiv:2411.10442 , year=

Pith/arXiv arXiv

[29] [29]

ICLR2025 , year=

Mask-dpo: Generalizable fine-grained factuality alignment of llms , author=. ICLR2025 , year=

[30] [30]

ICLR , year=

Perception-aware policy optimization for multimodal reasoning , author=. ICLR , year=

[31] [31]

arXiv preprint arXiv:2503.14476 , year=

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

Pith/arXiv arXiv

[32] [32]

arXiv preprint arXiv:2503.07365 , year=

Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning , author=. arXiv preprint arXiv:2503.07365 , year=

Pith/arXiv arXiv

[33] [33]

ICCV , year=

Visual-RFT: Visual Reinforcement Fine-Tuning , author=. ICCV , year=

[34] [34]

arXiv preprint arXiv:2507.18071 , year=

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

Pith/arXiv arXiv

[35] [35]

ACM MM , year=

Mer 2025: When affective computing meets large language models , author=. ACM MM , year=

2025

[36] [36]

arXiv preprint arXiv:2506.05176 , year=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=

Pith/arXiv arXiv

[37] [37]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Meld: A multimodal multi-party dataset for emotion recognition in conversations , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

[38] [38]

Language resources and evaluation , volume=

IEMOCAP: Interactive emotional dyadic motion capture database , author=. Language resources and evaluation , volume=. 2008 , publisher=

2008

[39] [39]

ACM MM , year=

Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild , author=. ACM MM , year=

[40] [40]

CoRR , year=

Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos , author=. CoRR , year=

[41] [41]

ACL , year=

Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph , author=. ACL , year=

[42] [42]

ACL , year=

Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality , author=. ACL , year=

[43] [43]

0 dataset and av-mixup consistent module , author=

Make acoustic and visual cues matter: Ch-sims v2. 0 dataset and av-mixup consistent module , author=. ICMI , year=

[44] [44]

IEEE Transactions on Affective Computing , volume=

Crema-d: Crowd-sourced emotional multimodal actors dataset , author=. IEEE Transactions on Affective Computing , volume=. 2014 , publisher=

2014

[45] [45]

IEEE Transactions on Affective Computing , volume=

Mgeed: A multimodal genuine emotion and expression detection database , author=. IEEE Transactions on Affective Computing , volume=. 2023 , publisher=

2023

[46] [46]

ACM MM , year=

Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning , author=. ACM MM , year=

2023

[47] [47]

Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing , year=

Mer 2024: Semi-supervised learning, noise robustness, and open-vocabulary multimodal emotion recognition , author=. Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing , year=

2024

[48] [48]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Merbench: A unified evaluation benchmark for multimodal emotion recognition , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

[49] [49]

arXiv preprint arXiv:2412.19437 , year=

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

Pith/arXiv arXiv

[50] [50]

CoRR , year=

Explainable multimodal emotion reasoning , author=. CoRR , year=

[51] [51]

ICML , year=

Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models , author=. ICML , year=

[52] [52]

NeurIPS , year=

Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning , author=. NeurIPS , year=

[53] [53]

arXiv preprint arXiv:2501.09502 , year=

Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis , author=. arXiv preprint arXiv:2501.09502 , year=

arXiv

[54] [54]

arXiv preprint arXiv:2501.15111 , year=

HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding , author=. arXiv preprint arXiv:2501.15111 , year=

arXiv

[55] [55]

arXiv preprint arXiv:2503.05379 , year=

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning , author=. arXiv preprint arXiv:2503.05379 , year=

arXiv

[56] [56]

arXiv preprint arXiv:2511.18437 , year=

Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning , author=. arXiv preprint arXiv:2511.18437 , year=

arXiv

[57] [57]

arXiv preprint arXiv:2509.21854 , year=

Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization , author=. arXiv preprint arXiv:2509.21854 , year=

arXiv

[58] [58]

arXiv preprint arXiv:2506.07218 , year=

Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward , author=. arXiv preprint arXiv:2506.07218 , year=

arXiv

[59] [59]

Nature Machine Intelligence , volume=

Parameter-efficient fine-tuning of large-scale pre-trained language models , author=. Nature Machine Intelligence , volume=

[60] [60]

arXiv preprint arXiv:2403.14608 , year=

Parameter-efficient fine-tuning for large models: A comprehensive survey , author=. arXiv preprint arXiv:2403.14608 , year=

Pith/arXiv arXiv

[61] [61]

ICLR , year=

Lora: Low-rank adaptation of large language models , author=. ICLR , year=

[62] [62]

arXiv preprint arXiv:2312.12379 , year=

Mixture of cluster-conditional lora experts for vision-language instruction tuning , author=. arXiv preprint arXiv:2312.12379 , year=

arXiv

[63] [63]

NeurIPS , year=

Hydralora: An asymmetric lora architecture for efficient fine-tuning , author=. NeurIPS , year=

[64] [64]

ACM MM , year=

Dfew: A large-scale database for recognizing dynamic facial expressions in the wild , author=. ACM MM , year=

[65] [65]

CVPR , year=

Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training , author=. CVPR , year=

[66] [66]

arXiv preprint arXiv:2407.21770 , year=

Moma: Efficient early-fusion pre-training with mixture of modality-aware experts , author=. arXiv preprint arXiv:2407.21770 , year=

arXiv

[67] [67]

CVPR , year=

Image as a foreign language: Beit pretraining for vision and vision-language tasks , author=. CVPR , year=

[68] [68]

Findings of EMNLP , year=

Scaling vision-language models with sparse mixture of experts , author=. Findings of EMNLP , year=

[69] [69]

CVPR , year=

Omni-smola: Boosting generalist multimodal models with soft mixture of low-rank experts , author=. CVPR , year=

[70] [70]

ACM computing surveys , volume=

Survey of hallucination in natural language generation , author=. ACM computing surveys , volume=

[71] [71]

arXiv preprint arXiv:2309.05922 , year=

A survey of hallucination in large foundation models , author=. arXiv preprint arXiv:2309.05922 , year=

Pith/arXiv arXiv

[72] [72]

arXiv preprint arXiv:2402.00253 , year=

A survey on hallucination in large vision-language models , author=. arXiv preprint arXiv:2402.00253 , year=

Pith/arXiv arXiv

[73] [73]

CVPR , year=

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation , author=. CVPR , year=

[74] [74]

CVPR , year=

Mitigating object hallucinations in large vision-language models through visual contrastive decoding , author=. CVPR , year=

[75] [75]

arXiv preprint arXiv:2411.15839 , year=

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding , author=. arXiv preprint arXiv:2411.15839 , year=

arXiv

[76] [76]

arXiv preprint arXiv:2411.09968 , year=

Seeing clearly by layer two: Enhancing attention heads to alleviate hallucination in lvlms , author=. arXiv preprint arXiv:2411.09968 , year=

arXiv

[77] [77]

ICLR , year=

Decoupled weight decay regularization , author=. ICLR , year=

[78] [78]

arXiv preprint arXiv:2303.15389 , year=

Eva-clip: Improved training techniques for clip at scale , author=. arXiv preprint arXiv:2303.15389 , year=

Pith/arXiv arXiv

[79] [79]

EMNLP , year=

Video-llava: Learning united visual representation by alignment before projection , author=. EMNLP , year=

[80] [80]

CVPR , year=

Onellm: One framework to align all modalities with language , author=. CVPR , year=