VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

Guojiang Shen; Linan ZHU; Xiangfan Chen; Xiangjie Kong; Xiao Han; Yuqian Fu; Zihao Zhai

arxiv: 2605.18547 · v1 · pith:7SYRR3VBnew · submitted 2026-05-18 · 💻 cs.AI

VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

Linan ZHU , Zihao Zhai , Xiao Han , Yuqian Fu , Xiangfan Chen , Xiangjie Kong , Guojiang Shen This is my paper

Pith reviewed 2026-05-20 10:56 UTC · model grok-4.3

classification 💻 cs.AI

keywords Emotion Recognition in ConversationVision-Language ModelsSpeaker-Centered LearningTuning-Free ApproachMultimodal FusionAffective ComputingVisual Features

0 comments

The pith

A speaker-centered framework uses frozen vision-language models without fine-tuning to recognize emotions in conversations by focusing on visual cues from the active speaker.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VISAFF to tackle emotion recognition in conversations by addressing the limitations of text-only methods and the high costs of fine-tuning large vision-language models. It proposes a two-stage process that first grounds the model on the active speaker's affective visuals using a tuning-free approach and then complements uncertain visuals with reliable textual and acoustic information. This method aims to efficiently leverage the reasoning capabilities of pre-trained models while avoiding focus on irrelevant background or non-active participants. Sympathetic readers would care because it promises accurate multimodal emotion detection at lower computational cost, making advanced AI more practical for real-time human-machine interactions.

Core claim

VISAFF consists of Speaker-Centered Affective Grounding, which unlocks the reasoning capabilities of frozen VLMs to focus on the active speaker's emotional visual cues, and Reliability-Guided Affective Complementation, which dynamically leverages textual and acoustic modalities to compensate for visual uncertainty, leading to highly competitive performance on real-world datasets without the need for expensive fine-tuning.

What carries the argument

The VISAFF framework featuring Speaker-Centered Affective Grounding to steer frozen VLMs toward the active speaker and Reliability-Guided Affective Complementation to fuse modalities based on reliability.

Load-bearing premise

That a tuning-free approach can unlock the reasoning capabilities of frozen VLMs to focus specifically on the active speaker's emotional visual cues without heavy training overheads or loss of accuracy.

What would settle it

If VISAFF performs substantially worse than fine-tuned alternatives or speaker-agnostic baselines on the evaluated datasets, or if the complementation mechanism fails to improve results when visuals are ambiguous.

Figures

Figures reproduced from arXiv: 2605.18547 by Guojiang Shen, Linan ZHU, Xiangfan Chen, Xiangjie Kong, Xiao Han, Yuqian Fu, Zihao Zhai.

**Figure 2.** Figure 2: Overall architecture of VISAFF. Stage 1, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative visualization of visual cues described by the model. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of RGAC under Different Initial Visual Confidence Levels. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Emotion Recognition in Conversation (ERC) is essential for effective human-machine interaction, aiming to identify speakers' emotional states in multi-turn dialogues. Early text-based methods struggle with complex scenarios like sarcasm because they inherently neglect vital non-verbal information. While recent Vision-Language Models (VLMs) address this by analyzing video directly, they are not inherently tailored for ERC and often focus on emotionally irrelevant background regions or passive listeners rather than the active speaker. Furthermore, fine-tuning these large models incurs prohibitive computational costs. Additionally, isolated visual signals are frequently ambiguous or technically compromised without the context of linguistic content and vocal prosody. To address these challenges, we propose VISAFF, a speaker-centered VISual AFFective feature learning framework for ERC. VISAFF consists of two stages: Speaker-Centered Affective Grounding and Reliability-Guided Affective Complementation. VISAFF utilizes a tuning-free approach to unlock the reasoning capabilities of frozen VLMs, efficiently steering them to focus on the active speaker's emotional visual cues without heavy training overheads. In the second stage, we introduce a reliability-guided affective complementation mechanism that dynamically leverages textual and acoustic modalities to compensate for visual uncertainty. Experiments on two real-world datasets demonstrate that VISAFF achieves highly competitive performance compared to state-of-the-art methods in a tuning-free setting, significantly enhancing computational efficiency by eliminating the need for expensive fine-tuning of large VLMs. The source code is available at https://anonymous.4open.science/r/speaker-2365/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VISAFF offers a practical tuning-free way to center VLMs on active speakers for ERC plus reliability-based multimodal complementation, but the grounding mechanism needs concrete details to back the claims.

read the letter

The main point is a two-stage framework that steers frozen VLMs toward the active speaker's visual cues for emotion recognition in conversation and then fills in gaps with text and audio when visuals are unreliable. This targets a genuine issue where standard VLMs drift to backgrounds or listeners and where fine-tuning is too costly for many uses. The speaker-centered grounding plus reliability-guided complementation is the specific new combination here, and it is presented as addressing gaps in prior VLM applications to ERC without parameter updates. The efficiency angle is handled directly by keeping the VLM frozen, which is a clear practical win if the performance holds. The abstract reports competitive results on two real-world datasets, which is the part worth checking against actual numbers and baselines. The soft spot is the lack of visible mechanism for the tuning-free speaker focus. Without a described prompt strategy, region selection, or attention adjustment, it is hard to see how the model is forced to ignore passive listeners in ambiguous frames, and that matches the stress-test concern. If the full paper includes ablations or attention maps showing the visual branch actually contributes distinct signal, that would tighten the argument; otherwise the gains may come mostly from the complementation step. This is for researchers working on multimodal ERC or efficient VLM deployment in dialogue systems. Readers who need low-compute visual features for conversation emotion would get usable ideas from it. The work deserves peer review so the implementation, results, and code can be examined for whether the centering actually delivers as claimed.

Referee Report

2 major / 2 minor

Summary. The paper introduces VISAFF, a two-stage framework for Emotion Recognition in Conversation (ERC). The first stage, Speaker-Centered Affective Grounding, uses a tuning-free method to steer frozen Vision-Language Models (VLMs) toward extracting emotional visual features from the active speaker rather than background or passive listeners. The second stage, Reliability-Guided Affective Complementation, dynamically integrates textual and acoustic modalities to address visual uncertainty. Experiments on two real-world datasets are reported to show highly competitive performance against state-of-the-art methods while eliminating the computational cost of fine-tuning large VLMs; source code is provided.

Significance. If the tuning-free speaker-centering mechanism proves effective, the work would offer a practical advance in multimodal ERC by demonstrating that frozen VLMs can be guided to relevant visual cues without parameter updates, thereby improving efficiency and accessibility. The availability of source code supports reproducibility and allows direct verification of the claimed gains.

major comments (2)

Abstract and §3.1: The claim that a tuning-free approach 'efficiently steering them to focus on the active speaker's emotional visual cues' is central to both the efficiency gain and the competitiveness assertion, yet the manuscript provides no explicit prompt template, region proposal, attention mask, or other mechanism that would enforce speaker-specific focus in a frozen VLM. Without this detail or an ablation isolating the visual branch's contribution, it remains unclear whether the reported performance exceeds what text/acoustic complementation alone would achieve.
§4 (Experiments): The abstract states that VISAFF achieves 'highly competitive performance' on two datasets in a tuning-free setting, but no quantitative tables, baseline comparisons, or ablation results are referenced that would demonstrate the visual features are speaker-centered rather than scene-level. If the visual branch contributes little beyond the complementation stage, the efficiency advantage is overstated.

minor comments (2)

The abstract mentions 'two real-world datasets' but does not name them or cite prior ERC benchmarks; adding these references would improve context.
Notation for the reliability-guided complementation (e.g., how reliability scores are computed) should be formalized with an equation in §3.2 for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of clarity regarding the tuning-free speaker-centering mechanism and the supporting experimental evidence. We address each major comment point by point below, proposing revisions where they strengthen the paper without misrepresenting our contributions.

read point-by-point responses

Referee: Abstract and §3.1: The claim that a tuning-free approach 'efficiently steering them to focus on the active speaker's emotional visual cues' is central to both the efficiency gain and the competitiveness assertion, yet the manuscript provides no explicit prompt template, region proposal, attention mask, or other mechanism that would enforce speaker-specific focus in a frozen VLM. Without this detail or an ablation isolating the visual branch's contribution, it remains unclear whether the reported performance exceeds what text/acoustic complementation alone would achieve.

Authors: We appreciate the referee's emphasis on this central claim. In §3.1, the Speaker-Centered Affective Grounding stage is described as using a tuning-free prompting strategy on frozen VLMs that incorporates active speaker identification to direct focus toward relevant emotional visual cues rather than background or passive listeners. To enhance clarity and address the concern directly, we will include the exact prompt template in the revised manuscript, along with any supporting details on how speaker cues are integrated (e.g., via textual descriptions of speaker regions). We will also add an ablation study isolating the visual branch by comparing the full model against a variant relying solely on the reliability-guided complementation stage. This will demonstrate the incremental contribution of the speaker-centered visual features. revision: yes
Referee: §4 (Experiments): The abstract states that VISAFF achieves 'highly competitive performance' on two datasets in a tuning-free setting, but no quantitative tables, baseline comparisons, or ablation results are referenced that would demonstrate the visual features are speaker-centered rather than scene-level. If the visual branch contributes little beyond the complementation stage, the efficiency advantage is overstated.

Authors: We acknowledge the need for more explicit linkage between the claims and the experimental evidence. Section 4 reports results on two real-world datasets with comparisons to state-of-the-art methods, showing competitive performance in the tuning-free setting, and the source code is provided for verification. To directly respond to the referee's point, we will revise §4 to include clearer references to the quantitative tables and add ablation experiments that contrast speaker-centered visual features against scene-level alternatives. These additions will substantiate that the speaker-centering mechanism provides meaningful gains beyond complementation alone, thereby supporting rather than overstating the efficiency advantages of avoiding VLM fine-tuning. revision: yes

Circularity Check

0 steps flagged

VISAFF framework is a new construction with no load-bearing reductions to fitted inputs or self-citations

full rationale

The paper introduces VISAFF as a two-stage framework (Speaker-Centered Affective Grounding followed by Reliability-Guided Affective Complementation) that applies a tuning-free method to frozen VLMs. Performance claims rest on experimental results from two real-world datasets rather than any derivation, equation, or parameter fit that reduces the claimed outcomes directly to prior inputs by construction. No self-definitional steps, fitted-input predictions, or uniqueness theorems imported from the authors' own prior work appear in the derivation chain. The central methodological choices remain independent of the reported results, making this a standard non-circular case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unverified ability of frozen VLMs to be steered toward speaker-specific emotional cues and on the effectiveness of the new complementation stage, both introduced without independent evidence beyond the performance claim.

axioms (1)

domain assumption Frozen VLMs can be steered via a tuning-free approach to focus on the active speaker's emotional visual cues.
This premise underpins the entire first stage and is stated as the solution to prior VLM limitations in the abstract.

invented entities (2)

Speaker-Centered Affective Grounding no independent evidence
purpose: To direct frozen VLMs toward active speaker emotional cues
New stage introduced to solve background and listener focus problems.
Reliability-Guided Affective Complementation no independent evidence
purpose: To dynamically compensate visual uncertainty using text and audio
New mechanism for multimodal fusion when visual signals are ambiguous.

pith-pipeline@v0.9.0 · 5821 in / 1335 out tokens · 56645 ms · 2026-05-20T10:56:03.828884+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

[1]

DER-GCN: Dialogue and event relation-aware graph convolutional neural network for multimodal dialog emotion recognition.arXiv preprint arXiv:2312.10579, 2024

Wei Ai, Yuntao Shou, Tao Meng, and Keqin Li. DER-GCN: Dialogue and event relation-aware graph convolutional neural network for multimodal dialog emotion recognition.arXiv preprint arXiv:2312.10579, 2024

work page arXiv 2024
[2]

Active speakers in context

Juan León Alcázar, Fabian Caba, Long Mai, Federico Perazzi, Joon-Young Lee, Pablo Arbeláez, and Bernard Ghanem. Active speakers in context. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12465–12474, 2020

work page 2020
[3]

Evaluating vision-language models for emotion recognition

Sree Bhattacharyya and James Z Wang. Evaluating vision-language models for emotion recognition. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1798–1820, 2025

work page 2025
[4]

Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

work page 2008
[5]

Towards multimodal sarcasm detection (an _obviously_ perfect paper)

Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmermann, Rada Mihalcea, and Soujanya Poria. Towards multimodal sarcasm detection (an _obviously_ perfect paper). InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4619–4629, 2019

work page 2019
[6]

Libreface: An open-source toolkit for deep facial expression analysis

Di Chang, Yufeng Yin, Zongjian Li, Minh Tran, and Mohammad Soleymani. Libreface: An open-source toolkit for deep facial expression analysis. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 8205–8215, 2024

work page 2024
[7]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

work page 2024
[8]

M2fnet: Multi-modal fusion network for emotion recognition in conversation

Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, and Naoyuki Onoe. M2fnet: Multi-modal fusion network for emotion recognition in conversation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4652–4661, 2022

work page 2022
[9]

Laerc-s: Improving llm-based emotion recognition in conversation with speaker characteristics

Yumeng Fu, Junjie Wu, Zhongjie Wang, Meishan Zhang, Lili Shan, Yulin Wu, and Bingquan Liu. Laerc-s: Improving llm-based emotion recognition in conversation with speaker characteristics. InProceedings of the 31st International Conference on Computational Linguistics, pages 6748–6761, 2025

work page 2025
[10]

Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation

Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNL...

work page 2019
[11]

Dialoguemmt: Dialogue scenes understanding enhanced multi-modal multi-task tuning for emotion recognition in conversations

Chenyuan He, Senbin Zhu, Hongde Liu, Fei Gao, Yuxiang Jia, Hongying Zan, and Min Peng. Dialoguemmt: Dialogue scenes understanding enhanced multi-modal multi-task tuning for emotion recognition in conversations. InProceedings of the 31st International Conference on Computational Linguistics, pages 2497–2512, 2025

work page 2025
[12]

Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations

Dou Hu, Xiaolong Hou, Lingwei Wei, Lianxin Jiang, and Yang Mo. Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7037–7041. IEEE, 2022

work page 2022
[13]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10

work page 2022
[14]

Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations

Taichi Ishiwatari, Yuki Yasuda, Taro Miyazaki, and Jun Goto. Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 7360–7370, 2020

work page 2020
[15]

Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

work page 2020
[16]

Emoberta: Speaker-aware emotion recognition in conversation with roberta

Taewoon Kim and Piek V ossen. Emoberta: Speaker-aware emotion recognition in conversation with roberta. arxiv 2021.arXiv preprint arXiv:2108.12009, 2021

work page arXiv 2021
[17]

CoMPM: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation

Joosung Lee and Wooin Lee. CoMPM: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5669–5679, 2022

work page 2022
[18]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

CTNet: Conversational transformer network for emotion recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 985–1000, 2021

Zheng Lian, Bin Liu, and Jianhua Tao. CTNet: Conversational transformer network for emotion recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 985–1000, 2021

work page 2021
[20]

A transformer-based model with self-distillation for multimodal emotion recognition in conversations.IEEE Trans- actions on Multimedia, 26:776–788, 2023

Hui Ma, Jian Wang, Hongfei Lin, Bo Zhang, Yijia Zhang, and Bo Xu. A transformer-based model with self-distillation for multimodal emotion recognition in conversations.IEEE Trans- actions on Multimedia, 26:776–788, 2023

work page 2023
[21]

emotion2vec: Self-supervised pre-training for speech emotion representation

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self-supervised pre-training for speech emotion representation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15747–15760, 2024

work page 2024
[22]

DialogueRNN: An attentive RNN for emotion detection in conversations

Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. DialogueRNN: An attentive RNN for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6818–6825, 2019

work page 2019
[23]

Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words

Saif Mohammad. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. InProceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 174–184, 2018

work page 2018
[24]

Omnivox: Zero-shot emotion recognition with omni-llms

John Murzaku and Owen Rambow. Omnivox: Zero-shot emotion recognition with omni-llms. arXiv preprint arXiv:2503.21480, 2025

work page arXiv 2025
[25]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Context-dependent sentiment analysis in user-generated videos

Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 873–883, 2017

work page 2017
[27]

Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

work page 2019
[28]

A multimodal corpus for emotion recognition in sarcasm

Anupama Ray, Shubham Mishra, Apoorva Nunna, and Pushpak Bhattacharyya. A multimodal corpus for emotion recognition in sarcasm. InProceedings of the thirteenth language resources and evaluation conference, pages 6992–7003, 2022. 11

work page 2022
[29]

Lr-gcn: Latent relation- aware graph convolutional network for conversational emotion recognition.IEEE Transactions on Multimedia, 24:4422–4432, 2021

Minjie Ren, Xiangdong Huang, Wenhui Li, Dan Song, and Weizhi Nie. Lr-gcn: Latent relation- aware graph convolutional network for conversational emotion recognition.IEEE Transactions on Multimedia, 24:4422–4432, 2021

work page 2021
[30]

Ava active speaker: An audio-visual dataset for active speaker detection

Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, et al. Ava active speaker: An audio-visual dataset for active speaker detection. InICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 449...

work page 2020
[31]

Multiemo: An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations

Tao Shi and Shao-Lun Huang. Multiemo: An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14752–14766, 2023

work page 2023
[32]

Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. InProceedings of the 29th ACM international conference on multimedia, pages 3927–3935, 2021

work page 2021
[33]

Adaptive graph learning for multimodal conversational emotion detection.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19089–19097, 2024

Geng Tu, Tian Xie, Bin Liang, Hongpeng Wang, and Ruifeng Xu. Adaptive graph learning for multimodal conversational emotion detection.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19089–19097, 2024

work page 2024
[34]

Fera 2017-addressing head pose in the third facial expression recognition and analysis challenge

Michel F Valstar, Enrique Sánchez-Lozano, Jeffrey F Cohn, László A Jeni, Jeffrey M Girard, Zheng Zhang, Lijun Yin, and Maja Pantic. Fera 2017-addressing head pose in the third facial expression recognition and analysis challenge. In2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 839–847. IEEE, 2017

work page 2017
[35]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Beyond silent letters: Amplifying llms in emotion recognition with vocal nuances

Zehui Wu, Ziwei Gong, Lin Ai, Pengyuan Shi, Kaan Donbekci, and Julia Hirschberg. Beyond silent letters: Amplifying llms in emotion recognition with vocal nuances. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2202–2218, 2025

work page 2025
[37]

Multimodal fusion via hypergraph autoencoder and contrastive learning for emotion recognition in conversation

Zijian Yi, Ziming Zhao, Zhishu Shen, and Tiehua Zhang. Multimodal fusion via hypergraph autoencoder and contrastive learning for emotion recognition in conversation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 4341–4348, 2024. doi: 10.1145/3664647.3681633

work page doi:10.1145/3664647.3681633 2024
[38]

ECERC: Evidence-cause attention network for multi-modal emotion recognition in conversation

Tao Zhang and Zhenhua Tan. ECERC: Evidence-cause attention network for multi-modal emotion recognition in conversation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2064–2077, Vienna, Austria, Ju...

work page 2064
[39]

Dialoguellm: Context and emotion knowledge-tuned large language models for emotion recognition in conversations.Neural Networks, page 107901, 2025

Yazhou Zhang, Mengyao Wang, Youxi Wu, Prayag Tiwari, Qiuchi Li, Benyou Wang, and Jing Qin. Dialoguellm: Context and emotion knowledge-tuned large language models for emotion recognition in conversations.Neural Networks, page 107901, 2025

work page 2025
[40]

Unicon: Unified context network for robust active speaker detection

Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan, and Xilin Chen. Unicon: Unified context network for robust active speaker detection. InProceedings of the 29th ACM international conference on multimedia, pages 3964–3972, 2021. 12

work page 2021

[1] [1]

DER-GCN: Dialogue and event relation-aware graph convolutional neural network for multimodal dialog emotion recognition.arXiv preprint arXiv:2312.10579, 2024

Wei Ai, Yuntao Shou, Tao Meng, and Keqin Li. DER-GCN: Dialogue and event relation-aware graph convolutional neural network for multimodal dialog emotion recognition.arXiv preprint arXiv:2312.10579, 2024

work page arXiv 2024

[2] [2]

Active speakers in context

Juan León Alcázar, Fabian Caba, Long Mai, Federico Perazzi, Joon-Young Lee, Pablo Arbeláez, and Bernard Ghanem. Active speakers in context. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12465–12474, 2020

work page 2020

[3] [3]

Evaluating vision-language models for emotion recognition

Sree Bhattacharyya and James Z Wang. Evaluating vision-language models for emotion recognition. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1798–1820, 2025

work page 2025

[4] [4]

Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

work page 2008

[5] [5]

Towards multimodal sarcasm detection (an _obviously_ perfect paper)

Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmermann, Rada Mihalcea, and Soujanya Poria. Towards multimodal sarcasm detection (an _obviously_ perfect paper). InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4619–4629, 2019

work page 2019

[6] [6]

Libreface: An open-source toolkit for deep facial expression analysis

Di Chang, Yufeng Yin, Zongjian Li, Minh Tran, and Mohammad Soleymani. Libreface: An open-source toolkit for deep facial expression analysis. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 8205–8215, 2024

work page 2024

[7] [7]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

work page 2024

[8] [8]

M2fnet: Multi-modal fusion network for emotion recognition in conversation

Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, and Naoyuki Onoe. M2fnet: Multi-modal fusion network for emotion recognition in conversation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4652–4661, 2022

work page 2022

[9] [9]

Laerc-s: Improving llm-based emotion recognition in conversation with speaker characteristics

Yumeng Fu, Junjie Wu, Zhongjie Wang, Meishan Zhang, Lili Shan, Yulin Wu, and Bingquan Liu. Laerc-s: Improving llm-based emotion recognition in conversation with speaker characteristics. InProceedings of the 31st International Conference on Computational Linguistics, pages 6748–6761, 2025

work page 2025

[10] [10]

Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation

Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNL...

work page 2019

[11] [11]

Dialoguemmt: Dialogue scenes understanding enhanced multi-modal multi-task tuning for emotion recognition in conversations

Chenyuan He, Senbin Zhu, Hongde Liu, Fei Gao, Yuxiang Jia, Hongying Zan, and Min Peng. Dialoguemmt: Dialogue scenes understanding enhanced multi-modal multi-task tuning for emotion recognition in conversations. InProceedings of the 31st International Conference on Computational Linguistics, pages 2497–2512, 2025

work page 2025

[12] [12]

Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations

Dou Hu, Xiaolong Hou, Lingwei Wei, Lianxin Jiang, and Yang Mo. Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7037–7041. IEEE, 2022

work page 2022

[13] [13]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10

work page 2022

[14] [14]

Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations

Taichi Ishiwatari, Yuki Yasuda, Taro Miyazaki, and Jun Goto. Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 7360–7370, 2020

work page 2020

[15] [15]

Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

work page 2020

[16] [16]

Emoberta: Speaker-aware emotion recognition in conversation with roberta

Taewoon Kim and Piek V ossen. Emoberta: Speaker-aware emotion recognition in conversation with roberta. arxiv 2021.arXiv preprint arXiv:2108.12009, 2021

work page arXiv 2021

[17] [17]

CoMPM: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation

Joosung Lee and Wooin Lee. CoMPM: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5669–5679, 2022

work page 2022

[18] [18]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

CTNet: Conversational transformer network for emotion recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 985–1000, 2021

Zheng Lian, Bin Liu, and Jianhua Tao. CTNet: Conversational transformer network for emotion recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 985–1000, 2021

work page 2021

[20] [20]

A transformer-based model with self-distillation for multimodal emotion recognition in conversations.IEEE Trans- actions on Multimedia, 26:776–788, 2023

Hui Ma, Jian Wang, Hongfei Lin, Bo Zhang, Yijia Zhang, and Bo Xu. A transformer-based model with self-distillation for multimodal emotion recognition in conversations.IEEE Trans- actions on Multimedia, 26:776–788, 2023

work page 2023

[21] [21]

emotion2vec: Self-supervised pre-training for speech emotion representation

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self-supervised pre-training for speech emotion representation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15747–15760, 2024

work page 2024

[22] [22]

DialogueRNN: An attentive RNN for emotion detection in conversations

Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. DialogueRNN: An attentive RNN for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6818–6825, 2019

work page 2019

[23] [23]

Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words

Saif Mohammad. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. InProceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 174–184, 2018

work page 2018

[24] [24]

Omnivox: Zero-shot emotion recognition with omni-llms

John Murzaku and Owen Rambow. Omnivox: Zero-shot emotion recognition with omni-llms. arXiv preprint arXiv:2503.21480, 2025

work page arXiv 2025

[25] [25]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Context-dependent sentiment analysis in user-generated videos

Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 873–883, 2017

work page 2017

[27] [27]

Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

work page 2019

[28] [28]

A multimodal corpus for emotion recognition in sarcasm

Anupama Ray, Shubham Mishra, Apoorva Nunna, and Pushpak Bhattacharyya. A multimodal corpus for emotion recognition in sarcasm. InProceedings of the thirteenth language resources and evaluation conference, pages 6992–7003, 2022. 11

work page 2022

[29] [29]

Lr-gcn: Latent relation- aware graph convolutional network for conversational emotion recognition.IEEE Transactions on Multimedia, 24:4422–4432, 2021

Minjie Ren, Xiangdong Huang, Wenhui Li, Dan Song, and Weizhi Nie. Lr-gcn: Latent relation- aware graph convolutional network for conversational emotion recognition.IEEE Transactions on Multimedia, 24:4422–4432, 2021

work page 2021

[30] [30]

Ava active speaker: An audio-visual dataset for active speaker detection

Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, et al. Ava active speaker: An audio-visual dataset for active speaker detection. InICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 449...

work page 2020

[31] [31]

Multiemo: An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations

Tao Shi and Shao-Lun Huang. Multiemo: An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14752–14766, 2023

work page 2023

[32] [32]

Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. InProceedings of the 29th ACM international conference on multimedia, pages 3927–3935, 2021

work page 2021

[33] [33]

Adaptive graph learning for multimodal conversational emotion detection.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19089–19097, 2024

Geng Tu, Tian Xie, Bin Liang, Hongpeng Wang, and Ruifeng Xu. Adaptive graph learning for multimodal conversational emotion detection.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19089–19097, 2024

work page 2024

[34] [34]

Fera 2017-addressing head pose in the third facial expression recognition and analysis challenge

Michel F Valstar, Enrique Sánchez-Lozano, Jeffrey F Cohn, László A Jeni, Jeffrey M Girard, Zheng Zhang, Lijun Yin, and Maja Pantic. Fera 2017-addressing head pose in the third facial expression recognition and analysis challenge. In2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 839–847. IEEE, 2017

work page 2017

[35] [35]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Beyond silent letters: Amplifying llms in emotion recognition with vocal nuances

Zehui Wu, Ziwei Gong, Lin Ai, Pengyuan Shi, Kaan Donbekci, and Julia Hirschberg. Beyond silent letters: Amplifying llms in emotion recognition with vocal nuances. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2202–2218, 2025

work page 2025

[37] [37]

Multimodal fusion via hypergraph autoencoder and contrastive learning for emotion recognition in conversation

Zijian Yi, Ziming Zhao, Zhishu Shen, and Tiehua Zhang. Multimodal fusion via hypergraph autoencoder and contrastive learning for emotion recognition in conversation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 4341–4348, 2024. doi: 10.1145/3664647.3681633

work page doi:10.1145/3664647.3681633 2024

[38] [38]

ECERC: Evidence-cause attention network for multi-modal emotion recognition in conversation

Tao Zhang and Zhenhua Tan. ECERC: Evidence-cause attention network for multi-modal emotion recognition in conversation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2064–2077, Vienna, Austria, Ju...

work page 2064

[39] [39]

Dialoguellm: Context and emotion knowledge-tuned large language models for emotion recognition in conversations.Neural Networks, page 107901, 2025

Yazhou Zhang, Mengyao Wang, Youxi Wu, Prayag Tiwari, Qiuchi Li, Benyou Wang, and Jing Qin. Dialoguellm: Context and emotion knowledge-tuned large language models for emotion recognition in conversations.Neural Networks, page 107901, 2025

work page 2025

[40] [40]

Unicon: Unified context network for robust active speaker detection

Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan, and Xilin Chen. Unicon: Unified context network for robust active speaker detection. InProceedings of the 29th ACM international conference on multimedia, pages 3964–3972, 2021. 12

work page 2021