AffectVerse: Emotional World Models for Multimodal Affective Computing

Bo Zhao; Fanghua Ye; Sicheng Zhao; Xiaojiang Peng; Yixin Ji; Zitong Yu

arxiv: 2605.19950 · v1 · pith:POBKBMGZnew · submitted 2026-05-19 · 💻 cs.CV

AffectVerse: Emotional World Models for Multimodal Affective Computing

Bo Zhao , Fanghua Ye , Yixin Ji , Sicheng Zhao , Xiaojiang Peng , Zitong YU This is my paper

Pith reviewed 2026-05-20 06:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal affective computingemotion recognitiontemporal imaginationbelief aggregationworld modelscross-modal predictionMLLM

0 comments

The pith

AffectVerse adds an emotion world module that predicts short-term affective changes from past multimodal cues to improve recognition accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that multimodal large language models perform better at emotion recognition when they explicitly model how affective states are expected to unfold over short horizons rather than treating inputs as static. Existing models fuse complete audiovisual and text data at once, leaving the dynamics of emotional change implicit, whereas humans integrate observed cues with forward expectations. AffectVerse equips a base model with an Emotion World Module that generates imagined future representations, compresses them into belief tokens, and injects those tokens to guide reasoning. This uses future prediction only as a training signal to make the current belief state carry transition information, without needing future data when the model runs. If the approach holds, it supplies a concrete mechanism for making affective computing more sensitive to change and yields measurable gains on standard benchmarks.

Core claim

AffectVerse is a Qwen2.5-Omni-based model equipped with an Emotion World Module that contains cross-modal temporal imagination for predicting future video and audio representations from past tokens, modality-aware multi-step attention to aggregate those predictions into belief tokens, and belief injection to insert the tokens into the LLM. The module treats future prediction as a past-conditioned self-supervised signal that forces the current belief state to encode transition cues predictive of subsequent affective change, without replacing observed-history modeling or requiring unseen signals at inference time.

What carries the argument

Emotion World Module, an action-free representation-level component that performs cross-modal temporal imagination followed by belief aggregation to encode transition cues in the current belief state for affective reasoning.

If this is right

The model records at least 2.57 percent higher accuracy than prior models across nine benchmarks.
Each added component—temporal imagination, cross-modal rollout, and belief aggregation—contributes measurable gains in controlled tests.
Predictive belief-state modeling functions as a practical alternative to purely static fusion for affective computing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same past-conditioned prediction structure might transfer to other sequential multimodal tasks where state changes matter, such as action anticipation.
Extending the horizon of the imagination step could test whether longer-range affective forecasts further improve reasoning on extended video clips.
The approach offers a route to make existing MLLMs more robust to missing or noisy frames by baking transition regularities into the belief tokens.

Load-bearing premise

Forcing the current belief state to encode transition cues via past-conditioned future prediction will produce more accurate affective reasoning in the LLM.

What would settle it

An ablation experiment on any of the nine benchmarks that shows zero or negative performance change when the temporal imagination and belief aggregation steps are removed.

Figures

Figures reproduced from arXiv: 2605.19950 by Bo Zhao, Fanghua Ye, Sicheng Zhao, Xiaojiang Peng, Yixin Ji, Zitong Yu.

**Figure 1.** Figure 1: Motivation and positioning of AffectVerse. AffectVerse introduces an Emotion World Module that inserts an intermediate Imagine stage, enabling the model to predict latent affective dynamics before updating the LLM’s emotional context. to replace past-context modeling; instead, it provides a past-conditioned objective that encourages the representation to encode cues predictive of subsequent affective chang… view at source ↗

**Figure 2.** Figure 2: Overall framework of AffectVerse. AffectVerse extracts audiovisual hidden states from Qwen2.5-Omni, temporally splits them at Tp, imagines future latent tokens with cross-modal multi-step rollout, aggregates imagined tokens through MAMA with boundary tokens vb/ab, and interleaves the resulting belief tokens into the LLM sequence for affective generation. Belief Injection, which respectively predict latent … view at source ↗

**Figure 3.** Figure 3: Visualization of MAMA belief aggregation. MAMA belief tokens attend nonuniformly to visual and acoustic memories: some integrate both modalities, while others specialize toward visual (e.g., B7) or acoustic evidence (e.g., B5–B6). This suggests learned modality-aware specialization from type-aware aggregation, rather than manually assigned token roles. visualizes the cross-attention from the learned belie… view at source ↗

**Figure 4.** Figure 4: Rollout depth trade-off. appraisal slots, either integrating both modalities or focusing on modalityspecific evidence. This supports MAMA’s goal of building a modalityaware belief state. Imagination Depth: Rollout Steps [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of belief token count. (Nb=16, −0.68% avg). The optimum is Nb=4, corresponding to 8 tokens under dual-modality. Cross-Modal vs. Self-Modal Imagination. The cross-modal imagination design is a key differentiator of AffectVerse: rather than predicting each modality’s future in isolation, each modality attends to both its own and the other modality’s past. Table 5 isolates this contribution by compa… view at source ↗

**Figure 6.** Figure 6: Effect of modality dropout. Modality Dropout Ratio. Modality dropout (p=0.15) trains the EWM to form beliefs from partial sensory evidence [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Case study. AffectGPT keeps predicting a negative emotion cluster across all observation ratios, while AffectVerse moves from stress under limited observation to the correct neutral prediction when the full audiovisual context is available. 4.4 Case Study In this case study, the text semantics and early facial cues suggest a potentially negative situation, so AffectGPT repeatedly outputs sadness-, anger-,… view at source ↗

**Figure 8.** Figure 8: Supplementary visualization of the full-observation correction case. This appendix figure provides an alternative compact visualization of the qualitative example in [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Humans infer emotions by integrating observed multimodal cues with expectations about how affective states may unfold. Existing multimodal large language models (MLLMs), however, often treat emotion recognition as static fusion over complete audiovisual-text inputs, leaving affective dynamics implicit. We propose AffectVerse, a Qwen2.5-Omni-based model equipped with an Emotion World Module (EWM), an action-free representation-level module for short-horizon latent affective prediction. \rev{EWM contains three modules: 1) Cross-Modal Temporal Imagination predicts future video/audio representations from past tokens with multi-step rollout. 2) MAMA(Modality-Aware Multi-step Attention) Belief Aggregation compresses imagined tokens into modality-aware belief tokens. 3) Belief Injection inserts these belief tokens into the LLM for affective reasoning.} AffectVerse uses future prediction as a past-conditioned self-supervised signal: it does not replace modeling observed history or require unseen signals at inference, but forces the current belief state to encode transition cues that are predictive of subsequent affective change. Across nine benchmarks, AffectVerse improves at least 2.57\% over other models, while controlled ablations show additive gains from temporal imagination, cross-modal rollout, and belief aggregation. These results suggest predictive belief-state modeling is a practical alternative for affective computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AffectVerse adds a predictive module to Qwen2.5-Omni for short-horizon affective forecasting and reports benchmark gains, but the evidence that the belief tokens specifically capture transition dynamics rather than extra capacity remains thin.

read the letter

The main thing to know is that this paper layers an Emotion World Module onto Qwen2.5-Omni so the model can imagine short-term future multimodal representations and use that to inform current affective reasoning. They treat future prediction as a past-only self-supervised signal during training, then drop the future stuff at inference and just inject the resulting belief tokens into the LLM.

Referee Report

2 major / 1 minor

Summary. The paper proposes AffectVerse, a Qwen2.5-Omni-based MLLM augmented with an Emotion World Module (EWM) for short-horizon latent affective prediction. EWM comprises Cross-Modal Temporal Imagination (multi-step rollout of future video/audio representations from past tokens), MAMA (Modality-Aware Multi-step Attention) Belief Aggregation (compressing imagined tokens into modality-aware belief tokens), and Belief Injection (inserting these tokens into the LLM). Future prediction serves as a past-conditioned self-supervised signal that does not replace observed history modeling or require unseen inputs at inference, but is intended to force the belief state to encode transition cues predictive of affective change. The manuscript reports at least 2.57% improvement across nine benchmarks, with controlled ablations indicating additive gains from temporal imagination, cross-modal rollout, and belief aggregation.

Significance. If the results and mechanism hold, this provides a practical demonstration that incorporating predictive belief-state modeling can improve multimodal affective reasoning in LLMs by making affective dynamics more explicit. The controlled ablations isolating contributions from each EWM component and the multi-benchmark evaluation are strengths that support claims of additive utility over static fusion approaches.

major comments (2)

[Abstract] Abstract: The central claim that future prediction forces the current belief state (via MAMA aggregation and injection) to encode transition cues predictive of affective change lacks any reported probe, visualization, auxiliary metric, or correlation analysis showing that the injected belief tokens specifically improve future-state prediction or align with emotion dynamics beyond generic capacity or cross-modal attention gains. This verification is load-bearing for distinguishing the intended world-model mechanism from architectural additions.
[Experimental results] Experimental results: The reported minimum 2.57% improvement and ablation gains are presented without details on exact baselines, dataset splits, statistical significance tests, or potential confounds (e.g., parameter count differences). These omissions limit verification of whether the gains are robust and attributable to the proposed components rather than implementation variations.

minor comments (1)

[Abstract] The parenthetical expansion of MAMA as (Modality-Aware Multi-step Attention) in the abstract could be clarified for consistency with standard acronym usage if it is intended as a defined module name.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review and constructive feedback on our manuscript. We have addressed each of the major comments in detail below. Where appropriate, we have revised the manuscript to incorporate additional analyses and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that future prediction forces the current belief state (via MAMA aggregation and injection) to encode transition cues predictive of affective change lacks any reported probe, visualization, auxiliary metric, or correlation analysis showing that the injected belief tokens specifically improve future-state prediction or align with emotion dynamics beyond generic capacity or cross-modal attention gains. This verification is load-bearing for distinguishing the intended world-model mechanism from architectural additions.

Authors: We thank the referee for highlighting this important aspect. The ablations in the original manuscript already isolate the contributions of the Cross-Modal Temporal Imagination and MAMA Belief Aggregation, showing gains beyond the base model's cross-modal capabilities. To further address the request for direct verification, we have included in the revised manuscript a new analysis that examines the predictive power of the injected belief tokens for future affective states. Specifically, we report the accuracy of a linear classifier trained on belief tokens to predict emotion transitions, demonstrating improved alignment with affective dynamics when the future prediction objective is included. revision: yes
Referee: [Experimental results] Experimental results: The reported minimum 2.57% improvement and ablation gains are presented without details on exact baselines, dataset splits, statistical significance tests, or potential confounds (e.g., parameter count differences). These omissions limit verification of whether the gains are robust and attributable to the proposed components rather than implementation variations.

Authors: The referee correctly notes the need for more detailed experimental reporting. We have revised the manuscript to include: exact specifications of the baseline models and their parameter counts for comparison; descriptions of the train/validation/test splits used for each benchmark; and results of statistical significance testing (paired t-tests with p-values) across multiple runs. Additionally, we discuss that the added parameters from the EWM are minimal and do not account for the observed improvements, as confirmed by the ablation studies. revision: yes

Circularity Check

0 steps flagged

No circularity: self-supervised future prediction is an independent training signal, not a definitional reduction

full rationale

The paper's core mechanism uses past-conditioned future prediction as an auxiliary self-supervised objective to shape belief tokens via MAMA aggregation and injection. This is presented as a training design that encourages encoding of transition cues without replacing observed history modeling or requiring unseen inputs at inference. Reported benchmark gains and ablations are external empirical outcomes, not quantities defined by the fitted parameters themselves. No equations or claims reduce the performance assertions to tautological redefinitions, fitted-input renamings, or self-citation chains. The approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on the abstract alone, the central claim rests on standard self-supervised learning assumptions and the effectiveness of the newly introduced modules; no explicit numerical free parameters are stated, and the new module is the primary addition beyond prior MLLM work.

axioms (1)

domain assumption Future multimodal representations can be predicted from past tokens to create useful belief states for current affective reasoning.
This premise enables the self-supervised training signal described in the abstract.

invented entities (1)

Emotion World Module (EWM) no independent evidence
purpose: To perform short-horizon latent affective prediction through temporal imagination and belief aggregation.
New architectural component introduced by the paper; no independent evidence outside the reported experiments is mentioned.

pith-pipeline@v0.9.0 · 5768 in / 1378 out tokens · 57034 ms · 2026-05-20T06:42:28.839590+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EWM contains three modules: 1) Cross-Modal Temporal Imagination predicts future video/audio representations from past tokens with multi-step rollout. 2) MAMA Belief Aggregation compresses imagined tokens into modality-aware belief tokens. 3) Belief Injection inserts these belief tokens into the LLM for affective reasoning.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

future prediction as a past-conditioned self-supervised signal ... forces the current belief state to encode transition cues

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 12 internal anchors

[1]

The theory of constructed emotion: an active inference account of interoception and categorization.Social Cognitive and Affective Neuroscience, 12(1):1–23, 2017

Lisa Feldman Barrett. The theory of constructed emotion: an active inference account of interoception and categorization.Social Cognitive and Affective Neuroscience, 12(1):1–23, 2017

work page 2017
[2]

IEMOCAP: Interactive emo- tional dyadic motion capture database.Language Resources and Evaluation, 42(4):335–359, 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. IEMOCAP: Interactive emo- tional dyadic motion capture database.Language Resources and Evaluation, 42(4):335–359, 2008

work page 2008
[3]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021

work page 2021
[4]

Emotion-LLaMA: Multimodal emotion recognition and reasoning with instruction tuning.arXiv preprint arXiv:2406.11161, 2024

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander Hauptmann. Emotion-LLaMA: Multimodal emotion recognition and reasoning with instruction tuning.arXiv preprint arXiv:2406.11161, 2024

work page arXiv 2024
[5]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Whatever next? Predictive brains, situated agents, and the future of cognitive science.Behavioral and Brain Sciences, 36(3):181–204, 2013

Andy Clark. Whatever next? Predictive brains, situated agents, and the future of cognitive science.Behavioral and Brain Sciences, 36(3):181–204, 2013

work page 2013
[7]

Humans integrate visual and haptic information in a statistically optimal fashion.Nature, 415(6870):429–433, 2002

Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion.Nature, 415(6870):429–433, 2002

work page 2002
[8]

The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

Karl Friston. The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

work page 2010
[9]

Bootstrap your own latent: A new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pinto, Zhaohan Zheng, Mohammad Gheshlaghi Azizi, Mateusz Malinowski, Yee Whye Teh, Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Bootstrap your own latent: A new approach to self-supervised learning. In Advanc...

work page 2020
[10]

World models

David Ha and Jürgen Schmidhuber. World models. InAdvances in Neural Information Pro- cessing Systems, volume 31, 2018

work page 2018
[11]

Mastering Atari with discrete world models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with discrete world models. InInternational Conference on Learning Representations, 2021

work page 2021
[12]

Mastering Diverse Domains through World Models

DanijarHafner, JurgisPasukonis, JimmyBa, andTimothyLillicrap. Masteringdiversedomains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

OneLLM: One framework to align all modalities with language

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. OneLLM: One framework to align all modalities with language. arXiv preprint arXiv:2312.03700, 2023

work page arXiv 2023
[14]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInter- national Conference on Learning Representations, 2022

work page 2022
[15]

Complementarity-supervised spectral-band routing for multimodal emotion recognition.arXiv preprint arXiv:2603.13340, 2026

Zhexian Huang, Bo Zhao, Hui Ma, Zhishu Liu, Jie Zhang, Ruixin Zhang, Shouhong Ding, and Zitong Yu. Complementarity-supervised spectral-band routing for multimodal emotion recognition.arXiv preprint arXiv:2603.13340, 2026. doi: 10.48550/arXiv.2603.13340

work page doi:10.48550/arxiv.2603.13340 2026
[16]

Chat-UniVi: Unified visual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-UniVi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023

work page arXiv 2023
[17]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

VideoChat: Chat-Centric Video Understanding

KunchangLi, YinanHe, YiWang, YizhuoLi, WenhaiWang, PingLuo, YaliWang, LiminWang, and Yu Qiao. VideoChat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A comprehensive multi-modal video understanding benchmark.arXiv preprint arXiv:2311.17005, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

LLaMA-VID: An image is worth 2 tokens in large language models.arXiv preprint arXiv:2311.17043, 2023

Yanwei Li, Chengyao Wang, and Jiaya Jia. LLaMA-VID: An image is worth 2 tokens in large language models.arXiv preprint arXiv:2311.17043, 2023

work page arXiv 2023
[21]

AffectGPT: Multimodal large language model for emotion recognition.arXiv preprint arXiv:2306.15401, 2023

Zheng Lian, Licai Sun, Mingyu Xu, Haiyang Sun, Ke Xu, Zhuofan Wen, Shun Chen, Bin Liu, and Jianhua Tao. AffectGPT: Multimodal large language model for emotion recognition.arXiv preprint arXiv:2306.15401, 2023

work page arXiv 2023
[22]

OV-MER: Towards open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2410.01495, 2024

Zheng Lian, Haiyang Sun, Licai Sun, Haoyu Chen, Lan Chen, Hao Gu, Zhuofan Wen, Shun Chen, Siyuan Zhang, Hailiang Yao, Bin Liu, Rui Liu, Shan Liang, Ya Li, Jiangyan Yi, and Jianhua Tao. OV-MER: Towards open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2410.01495, 2024

work page arXiv 2024
[23]

MER 2024: Semi-supervised learning, noise robustness, and open- vocabulary multimodal emotion recognition

Zheng Lian, Licai Sun, Mingyu Xu, Haiyang Sun, Ke Xu, Zhuofan Wen, Shun Chen, Bin Liu, and Jianhua Tao. MER 2024: Semi-supervised learning, noise robustness, and open- vocabulary multimodal emotion recognition. InProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, 2024. 13

work page 2024
[24]

Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2508.01318, 2025

ZhengLian, FanZhang, YazhouZhang, JianhuaTao, RuiLiu, HaoyuChen, XiaobaiLi, andBin He. Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2508.01318, 2025. doi: 10.48550/arXiv.2508.01318

work page doi:10.48550/arxiv.2508.01318 2025
[25]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video- LLaVA: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

CH-SIMS v2.0: A fine-grained dataset for multimodal sen- timent analysis in chinese

Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqiuyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, and Kai Gao. CH-SIMS v2.0: A fine-grained dataset for multimodal sen- timent analysis in chinese. InProceedings of the 2022 International Conference on Multimodal Interaction, pages 678–689, 2022

work page 2022
[27]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video- ChatGPT: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Emotion-llamav2 and mmeverse: A new framework and benchmark for multimodal emotion understanding.arXiv preprint arXiv:2601.16449, 2026

Xiaojiang Peng, Jingyi Chen, Zebang Cheng, Bao Peng, Fengyi Wu, Yifei Dong, Shuyuan Tu, Qiyu Hu, Huiting Huang, Yuxiang Lin, Jun-Yan He, Kai Wang, Zheng Lian, and Zhi-Qi Cheng. Emotion-llamav2 and mmeverse: A new framework and benchmark for multimodal emotion understanding.arXiv preprint arXiv:2601.16449, 2026. doi: 10.48550/arXiv.2601.16449

work page doi:10.48550/arxiv.2601.16449 2026
[29]

MELD: A multimodal multi-party dataset for emotion recognition in con- versations

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in con- versations. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536, 2019

work page 2019
[30]

Qwen2.5-Omni Technical Report

Qwen Team. Qwen2.5-Omni technical report.arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Klaus R Scherer. Emotions are emergent processes: they require a dynamic computational architecture.Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1535): 3459–3474, 2009

work page 2009
[32]

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. PandaGPT: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. SALMONN: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

The information bottleneck method

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000

work page internal anchor Pith review Pith/arXiv arXiv 2000
[35]

SECap: Speech emotion captioning with large language model

Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shixiong Zhang, Guangzhi Li, and Yi Luo. SECap: Speech emotion captioning with large language model. arXiv preprint arXiv:2312.10381, 2023

work page arXiv 2023
[36]

EmoVIT: Multimodal emotion understanding with vision instruction tuning.arXiv preprint arXiv:2404.16670, 2024

Hongxia Yang, Siyang Zhao, and Sheng Li. EmoVIT: Multimodal emotion understanding with vision instruction tuning.arXiv preprint arXiv:2404.16670, 2024

work page arXiv 2024
[37]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, 14 Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mPLUG-Owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

CH-SIMS: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality

Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. CH-SIMS: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3718–3727, 2020

work page 2020
[39]

Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. volume 31, pages 82–88. IEEE, 2016

work page 2016
[40]

You are an emotion recognition assistant,

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Mul- timodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics, pages 2236–2246, 2018. A Detailed Dataset Descriptions We evaluate AffectVers...

work page 2018

[1] [1]

The theory of constructed emotion: an active inference account of interoception and categorization.Social Cognitive and Affective Neuroscience, 12(1):1–23, 2017

Lisa Feldman Barrett. The theory of constructed emotion: an active inference account of interoception and categorization.Social Cognitive and Affective Neuroscience, 12(1):1–23, 2017

work page 2017

[2] [2]

IEMOCAP: Interactive emo- tional dyadic motion capture database.Language Resources and Evaluation, 42(4):335–359, 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. IEMOCAP: Interactive emo- tional dyadic motion capture database.Language Resources and Evaluation, 42(4):335–359, 2008

work page 2008

[3] [3]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021

work page 2021

[4] [4]

Emotion-LLaMA: Multimodal emotion recognition and reasoning with instruction tuning.arXiv preprint arXiv:2406.11161, 2024

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander Hauptmann. Emotion-LLaMA: Multimodal emotion recognition and reasoning with instruction tuning.arXiv preprint arXiv:2406.11161, 2024

work page arXiv 2024

[5] [5]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Whatever next? Predictive brains, situated agents, and the future of cognitive science.Behavioral and Brain Sciences, 36(3):181–204, 2013

Andy Clark. Whatever next? Predictive brains, situated agents, and the future of cognitive science.Behavioral and Brain Sciences, 36(3):181–204, 2013

work page 2013

[7] [7]

Humans integrate visual and haptic information in a statistically optimal fashion.Nature, 415(6870):429–433, 2002

Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion.Nature, 415(6870):429–433, 2002

work page 2002

[8] [8]

The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

Karl Friston. The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

work page 2010

[9] [9]

Bootstrap your own latent: A new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pinto, Zhaohan Zheng, Mohammad Gheshlaghi Azizi, Mateusz Malinowski, Yee Whye Teh, Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Bootstrap your own latent: A new approach to self-supervised learning. In Advanc...

work page 2020

[10] [10]

World models

David Ha and Jürgen Schmidhuber. World models. InAdvances in Neural Information Pro- cessing Systems, volume 31, 2018

work page 2018

[11] [11]

Mastering Atari with discrete world models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with discrete world models. InInternational Conference on Learning Representations, 2021

work page 2021

[12] [12]

Mastering Diverse Domains through World Models

DanijarHafner, JurgisPasukonis, JimmyBa, andTimothyLillicrap. Masteringdiversedomains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

OneLLM: One framework to align all modalities with language

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. OneLLM: One framework to align all modalities with language. arXiv preprint arXiv:2312.03700, 2023

work page arXiv 2023

[14] [14]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInter- national Conference on Learning Representations, 2022

work page 2022

[15] [15]

Complementarity-supervised spectral-band routing for multimodal emotion recognition.arXiv preprint arXiv:2603.13340, 2026

Zhexian Huang, Bo Zhao, Hui Ma, Zhishu Liu, Jie Zhang, Ruixin Zhang, Shouhong Ding, and Zitong Yu. Complementarity-supervised spectral-band routing for multimodal emotion recognition.arXiv preprint arXiv:2603.13340, 2026. doi: 10.48550/arXiv.2603.13340

work page doi:10.48550/arxiv.2603.13340 2026

[16] [16]

Chat-UniVi: Unified visual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-UniVi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023

work page arXiv 2023

[17] [17]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

VideoChat: Chat-Centric Video Understanding

KunchangLi, YinanHe, YiWang, YizhuoLi, WenhaiWang, PingLuo, YaliWang, LiminWang, and Yu Qiao. VideoChat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A comprehensive multi-modal video understanding benchmark.arXiv preprint arXiv:2311.17005, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

LLaMA-VID: An image is worth 2 tokens in large language models.arXiv preprint arXiv:2311.17043, 2023

Yanwei Li, Chengyao Wang, and Jiaya Jia. LLaMA-VID: An image is worth 2 tokens in large language models.arXiv preprint arXiv:2311.17043, 2023

work page arXiv 2023

[21] [21]

AffectGPT: Multimodal large language model for emotion recognition.arXiv preprint arXiv:2306.15401, 2023

Zheng Lian, Licai Sun, Mingyu Xu, Haiyang Sun, Ke Xu, Zhuofan Wen, Shun Chen, Bin Liu, and Jianhua Tao. AffectGPT: Multimodal large language model for emotion recognition.arXiv preprint arXiv:2306.15401, 2023

work page arXiv 2023

[22] [22]

OV-MER: Towards open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2410.01495, 2024

Zheng Lian, Haiyang Sun, Licai Sun, Haoyu Chen, Lan Chen, Hao Gu, Zhuofan Wen, Shun Chen, Siyuan Zhang, Hailiang Yao, Bin Liu, Rui Liu, Shan Liang, Ya Li, Jiangyan Yi, and Jianhua Tao. OV-MER: Towards open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2410.01495, 2024

work page arXiv 2024

[23] [23]

MER 2024: Semi-supervised learning, noise robustness, and open- vocabulary multimodal emotion recognition

Zheng Lian, Licai Sun, Mingyu Xu, Haiyang Sun, Ke Xu, Zhuofan Wen, Shun Chen, Bin Liu, and Jianhua Tao. MER 2024: Semi-supervised learning, noise robustness, and open- vocabulary multimodal emotion recognition. InProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, 2024. 13

work page 2024

[24] [24]

Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2508.01318, 2025

ZhengLian, FanZhang, YazhouZhang, JianhuaTao, RuiLiu, HaoyuChen, XiaobaiLi, andBin He. Affectgpt-r1: Leveraging reinforcement learning for open-vocabulary multimodal emotion recognition.arXiv preprint arXiv:2508.01318, 2025. doi: 10.48550/arXiv.2508.01318

work page doi:10.48550/arxiv.2508.01318 2025

[25] [25]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video- LLaVA: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

CH-SIMS v2.0: A fine-grained dataset for multimodal sen- timent analysis in chinese

Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqiuyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, and Kai Gao. CH-SIMS v2.0: A fine-grained dataset for multimodal sen- timent analysis in chinese. InProceedings of the 2022 International Conference on Multimodal Interaction, pages 678–689, 2022

work page 2022

[27] [27]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video- ChatGPT: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Emotion-llamav2 and mmeverse: A new framework and benchmark for multimodal emotion understanding.arXiv preprint arXiv:2601.16449, 2026

Xiaojiang Peng, Jingyi Chen, Zebang Cheng, Bao Peng, Fengyi Wu, Yifei Dong, Shuyuan Tu, Qiyu Hu, Huiting Huang, Yuxiang Lin, Jun-Yan He, Kai Wang, Zheng Lian, and Zhi-Qi Cheng. Emotion-llamav2 and mmeverse: A new framework and benchmark for multimodal emotion understanding.arXiv preprint arXiv:2601.16449, 2026. doi: 10.48550/arXiv.2601.16449

work page doi:10.48550/arxiv.2601.16449 2026

[29] [29]

MELD: A multimodal multi-party dataset for emotion recognition in con- versations

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in con- versations. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536, 2019

work page 2019

[30] [30]

Qwen2.5-Omni Technical Report

Qwen Team. Qwen2.5-Omni technical report.arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Klaus R Scherer. Emotions are emergent processes: they require a dynamic computational architecture.Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1535): 3459–3474, 2009

work page 2009

[32] [32]

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. PandaGPT: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. SALMONN: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

The information bottleneck method

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000

work page internal anchor Pith review Pith/arXiv arXiv 2000

[35] [35]

SECap: Speech emotion captioning with large language model

Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shixiong Zhang, Guangzhi Li, and Yi Luo. SECap: Speech emotion captioning with large language model. arXiv preprint arXiv:2312.10381, 2023

work page arXiv 2023

[36] [36]

EmoVIT: Multimodal emotion understanding with vision instruction tuning.arXiv preprint arXiv:2404.16670, 2024

Hongxia Yang, Siyang Zhao, and Sheng Li. EmoVIT: Multimodal emotion understanding with vision instruction tuning.arXiv preprint arXiv:2404.16670, 2024

work page arXiv 2024

[37] [37]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, 14 Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mPLUG-Owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

CH-SIMS: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality

Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. CH-SIMS: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3718–3727, 2020

work page 2020

[39] [39]

Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. volume 31, pages 82–88. IEEE, 2016

work page 2016

[40] [40]

You are an emotion recognition assistant,

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Mul- timodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics, pages 2236–2246, 2018. A Detailed Dataset Descriptions We evaluate AffectVers...

work page 2018