pith. sign in

arxiv: 2605.18547 · v1 · pith:7SYRR3VBnew · submitted 2026-05-18 · 💻 cs.AI

VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

Pith reviewed 2026-05-20 10:56 UTC · model grok-4.3

classification 💻 cs.AI
keywords Emotion Recognition in ConversationVision-Language ModelsSpeaker-Centered LearningTuning-Free ApproachMultimodal FusionAffective ComputingVisual Features
0
0 comments X

The pith

A speaker-centered framework uses frozen vision-language models without fine-tuning to recognize emotions in conversations by focusing on visual cues from the active speaker.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VISAFF to tackle emotion recognition in conversations by addressing the limitations of text-only methods and the high costs of fine-tuning large vision-language models. It proposes a two-stage process that first grounds the model on the active speaker's affective visuals using a tuning-free approach and then complements uncertain visuals with reliable textual and acoustic information. This method aims to efficiently leverage the reasoning capabilities of pre-trained models while avoiding focus on irrelevant background or non-active participants. Sympathetic readers would care because it promises accurate multimodal emotion detection at lower computational cost, making advanced AI more practical for real-time human-machine interactions.

Core claim

VISAFF consists of Speaker-Centered Affective Grounding, which unlocks the reasoning capabilities of frozen VLMs to focus on the active speaker's emotional visual cues, and Reliability-Guided Affective Complementation, which dynamically leverages textual and acoustic modalities to compensate for visual uncertainty, leading to highly competitive performance on real-world datasets without the need for expensive fine-tuning.

What carries the argument

The VISAFF framework featuring Speaker-Centered Affective Grounding to steer frozen VLMs toward the active speaker and Reliability-Guided Affective Complementation to fuse modalities based on reliability.

Load-bearing premise

That a tuning-free approach can unlock the reasoning capabilities of frozen VLMs to focus specifically on the active speaker's emotional visual cues without heavy training overheads or loss of accuracy.

What would settle it

If VISAFF performs substantially worse than fine-tuned alternatives or speaker-agnostic baselines on the evaluated datasets, or if the complementation mechanism fails to improve results when visuals are ambiguous.

Figures

Figures reproduced from arXiv: 2605.18547 by Guojiang Shen, Linan ZHU, Xiangfan Chen, Xiangjie Kong, Xiao Han, Yuqian Fu, Zihao Zhai.

Figure 1
Figure 1. Figure 1: Motivation of speaker-centered visual affective feature learning for ERC. (a) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of VISAFF. Stage 1, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative visualization of visual cues described by the model. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of RGAC under Different Initial Visual Confidence Levels. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Emotion Recognition in Conversation (ERC) is essential for effective human-machine interaction, aiming to identify speakers' emotional states in multi-turn dialogues. Early text-based methods struggle with complex scenarios like sarcasm because they inherently neglect vital non-verbal information. While recent Vision-Language Models (VLMs) address this by analyzing video directly, they are not inherently tailored for ERC and often focus on emotionally irrelevant background regions or passive listeners rather than the active speaker. Furthermore, fine-tuning these large models incurs prohibitive computational costs. Additionally, isolated visual signals are frequently ambiguous or technically compromised without the context of linguistic content and vocal prosody. To address these challenges, we propose VISAFF, a speaker-centered VISual AFFective feature learning framework for ERC. VISAFF consists of two stages: Speaker-Centered Affective Grounding and Reliability-Guided Affective Complementation. VISAFF utilizes a tuning-free approach to unlock the reasoning capabilities of frozen VLMs, efficiently steering them to focus on the active speaker's emotional visual cues without heavy training overheads. In the second stage, we introduce a reliability-guided affective complementation mechanism that dynamically leverages textual and acoustic modalities to compensate for visual uncertainty. Experiments on two real-world datasets demonstrate that VISAFF achieves highly competitive performance compared to state-of-the-art methods in a tuning-free setting, significantly enhancing computational efficiency by eliminating the need for expensive fine-tuning of large VLMs. The source code is available at https://anonymous.4open.science/r/speaker-2365/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VISAFF, a two-stage framework for Emotion Recognition in Conversation (ERC). The first stage, Speaker-Centered Affective Grounding, uses a tuning-free method to steer frozen Vision-Language Models (VLMs) toward extracting emotional visual features from the active speaker rather than background or passive listeners. The second stage, Reliability-Guided Affective Complementation, dynamically integrates textual and acoustic modalities to address visual uncertainty. Experiments on two real-world datasets are reported to show highly competitive performance against state-of-the-art methods while eliminating the computational cost of fine-tuning large VLMs; source code is provided.

Significance. If the tuning-free speaker-centering mechanism proves effective, the work would offer a practical advance in multimodal ERC by demonstrating that frozen VLMs can be guided to relevant visual cues without parameter updates, thereby improving efficiency and accessibility. The availability of source code supports reproducibility and allows direct verification of the claimed gains.

major comments (2)
  1. Abstract and §3.1: The claim that a tuning-free approach 'efficiently steering them to focus on the active speaker's emotional visual cues' is central to both the efficiency gain and the competitiveness assertion, yet the manuscript provides no explicit prompt template, region proposal, attention mask, or other mechanism that would enforce speaker-specific focus in a frozen VLM. Without this detail or an ablation isolating the visual branch's contribution, it remains unclear whether the reported performance exceeds what text/acoustic complementation alone would achieve.
  2. §4 (Experiments): The abstract states that VISAFF achieves 'highly competitive performance' on two datasets in a tuning-free setting, but no quantitative tables, baseline comparisons, or ablation results are referenced that would demonstrate the visual features are speaker-centered rather than scene-level. If the visual branch contributes little beyond the complementation stage, the efficiency advantage is overstated.
minor comments (2)
  1. The abstract mentions 'two real-world datasets' but does not name them or cite prior ERC benchmarks; adding these references would improve context.
  2. Notation for the reliability-guided complementation (e.g., how reliability scores are computed) should be formalized with an equation in §3.2 for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of clarity regarding the tuning-free speaker-centering mechanism and the supporting experimental evidence. We address each major comment point by point below, proposing revisions where they strengthen the paper without misrepresenting our contributions.

read point-by-point responses
  1. Referee: Abstract and §3.1: The claim that a tuning-free approach 'efficiently steering them to focus on the active speaker's emotional visual cues' is central to both the efficiency gain and the competitiveness assertion, yet the manuscript provides no explicit prompt template, region proposal, attention mask, or other mechanism that would enforce speaker-specific focus in a frozen VLM. Without this detail or an ablation isolating the visual branch's contribution, it remains unclear whether the reported performance exceeds what text/acoustic complementation alone would achieve.

    Authors: We appreciate the referee's emphasis on this central claim. In §3.1, the Speaker-Centered Affective Grounding stage is described as using a tuning-free prompting strategy on frozen VLMs that incorporates active speaker identification to direct focus toward relevant emotional visual cues rather than background or passive listeners. To enhance clarity and address the concern directly, we will include the exact prompt template in the revised manuscript, along with any supporting details on how speaker cues are integrated (e.g., via textual descriptions of speaker regions). We will also add an ablation study isolating the visual branch by comparing the full model against a variant relying solely on the reliability-guided complementation stage. This will demonstrate the incremental contribution of the speaker-centered visual features. revision: yes

  2. Referee: §4 (Experiments): The abstract states that VISAFF achieves 'highly competitive performance' on two datasets in a tuning-free setting, but no quantitative tables, baseline comparisons, or ablation results are referenced that would demonstrate the visual features are speaker-centered rather than scene-level. If the visual branch contributes little beyond the complementation stage, the efficiency advantage is overstated.

    Authors: We acknowledge the need for more explicit linkage between the claims and the experimental evidence. Section 4 reports results on two real-world datasets with comparisons to state-of-the-art methods, showing competitive performance in the tuning-free setting, and the source code is provided for verification. To directly respond to the referee's point, we will revise §4 to include clearer references to the quantitative tables and add ablation experiments that contrast speaker-centered visual features against scene-level alternatives. These additions will substantiate that the speaker-centering mechanism provides meaningful gains beyond complementation alone, thereby supporting rather than overstating the efficiency advantages of avoiding VLM fine-tuning. revision: yes

Circularity Check

0 steps flagged

VISAFF framework is a new construction with no load-bearing reductions to fitted inputs or self-citations

full rationale

The paper introduces VISAFF as a two-stage framework (Speaker-Centered Affective Grounding followed by Reliability-Guided Affective Complementation) that applies a tuning-free method to frozen VLMs. Performance claims rest on experimental results from two real-world datasets rather than any derivation, equation, or parameter fit that reduces the claimed outcomes directly to prior inputs by construction. No self-definitional steps, fitted-input predictions, or uniqueness theorems imported from the authors' own prior work appear in the derivation chain. The central methodological choices remain independent of the reported results, making this a standard non-circular case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unverified ability of frozen VLMs to be steered toward speaker-specific emotional cues and on the effectiveness of the new complementation stage, both introduced without independent evidence beyond the performance claim.

axioms (1)
  • domain assumption Frozen VLMs can be steered via a tuning-free approach to focus on the active speaker's emotional visual cues.
    This premise underpins the entire first stage and is stated as the solution to prior VLM limitations in the abstract.
invented entities (2)
  • Speaker-Centered Affective Grounding no independent evidence
    purpose: To direct frozen VLMs toward active speaker emotional cues
    New stage introduced to solve background and listener focus problems.
  • Reliability-Guided Affective Complementation no independent evidence
    purpose: To dynamically compensate visual uncertainty using text and audio
    New mechanism for multimodal fusion when visual signals are ambiguous.

pith-pipeline@v0.9.0 · 5821 in / 1335 out tokens · 56645 ms · 2026-05-20T10:56:03.828884+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    DER-GCN: Dialogue and event relation-aware graph convolutional neural network for multimodal dialog emotion recognition.arXiv preprint arXiv:2312.10579, 2024

    Wei Ai, Yuntao Shou, Tao Meng, and Keqin Li. DER-GCN: Dialogue and event relation-aware graph convolutional neural network for multimodal dialog emotion recognition.arXiv preprint arXiv:2312.10579, 2024

  2. [2]

    Active speakers in context

    Juan León Alcázar, Fabian Caba, Long Mai, Federico Perazzi, Joon-Young Lee, Pablo Arbeláez, and Bernard Ghanem. Active speakers in context. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12465–12474, 2020

  3. [3]

    Evaluating vision-language models for emotion recognition

    Sree Bhattacharyya and James Z Wang. Evaluating vision-language models for emotion recognition. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1798–1820, 2025

  4. [4]

    Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

  5. [5]

    Towards multimodal sarcasm detection (an _obviously_ perfect paper)

    Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmermann, Rada Mihalcea, and Soujanya Poria. Towards multimodal sarcasm detection (an _obviously_ perfect paper). InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4619–4629, 2019

  6. [6]

    Libreface: An open-source toolkit for deep facial expression analysis

    Di Chang, Yufeng Yin, Zongjian Li, Minh Tran, and Mohammad Soleymani. Libreface: An open-source toolkit for deep facial expression analysis. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 8205–8215, 2024

  7. [7]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  8. [8]

    M2fnet: Multi-modal fusion network for emotion recognition in conversation

    Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, and Naoyuki Onoe. M2fnet: Multi-modal fusion network for emotion recognition in conversation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4652–4661, 2022

  9. [9]

    Laerc-s: Improving llm-based emotion recognition in conversation with speaker characteristics

    Yumeng Fu, Junjie Wu, Zhongjie Wang, Meishan Zhang, Lili Shan, Yulin Wu, and Bingquan Liu. Laerc-s: Improving llm-based emotion recognition in conversation with speaker characteristics. InProceedings of the 31st International Conference on Computational Linguistics, pages 6748–6761, 2025

  10. [10]

    Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation

    Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNL...

  11. [11]

    Dialoguemmt: Dialogue scenes understanding enhanced multi-modal multi-task tuning for emotion recognition in conversations

    Chenyuan He, Senbin Zhu, Hongde Liu, Fei Gao, Yuxiang Jia, Hongying Zan, and Min Peng. Dialoguemmt: Dialogue scenes understanding enhanced multi-modal multi-task tuning for emotion recognition in conversations. InProceedings of the 31st International Conference on Computational Linguistics, pages 2497–2512, 2025

  12. [12]

    Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations

    Dou Hu, Xiaolong Hou, Lingwei Wei, Lianxin Jiang, and Yang Mo. Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7037–7041. IEEE, 2022

  13. [13]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10

  14. [14]

    Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations

    Taichi Ishiwatari, Yuki Yasuda, Taro Miyazaki, and Jun Goto. Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 7360–7370, 2020

  15. [15]

    Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

  16. [16]

    Emoberta: Speaker-aware emotion recognition in conversation with roberta

    Taewoon Kim and Piek V ossen. Emoberta: Speaker-aware emotion recognition in conversation with roberta. arxiv 2021.arXiv preprint arXiv:2108.12009, 2021

  17. [17]

    CoMPM: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation

    Joosung Lee and Wooin Lee. CoMPM: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5669–5679, 2022

  18. [18]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

  19. [19]

    CTNet: Conversational transformer network for emotion recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 985–1000, 2021

    Zheng Lian, Bin Liu, and Jianhua Tao. CTNet: Conversational transformer network for emotion recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 985–1000, 2021

  20. [20]

    A transformer-based model with self-distillation for multimodal emotion recognition in conversations.IEEE Trans- actions on Multimedia, 26:776–788, 2023

    Hui Ma, Jian Wang, Hongfei Lin, Bo Zhang, Yijia Zhang, and Bo Xu. A transformer-based model with self-distillation for multimodal emotion recognition in conversations.IEEE Trans- actions on Multimedia, 26:776–788, 2023

  21. [21]

    emotion2vec: Self-supervised pre-training for speech emotion representation

    Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self-supervised pre-training for speech emotion representation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15747–15760, 2024

  22. [22]

    DialogueRNN: An attentive RNN for emotion detection in conversations

    Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. DialogueRNN: An attentive RNN for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6818–6825, 2019

  23. [23]

    Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words

    Saif Mohammad. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. InProceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 174–184, 2018

  24. [24]

    Omnivox: Zero-shot emotion recognition with omni-llms

    John Murzaku and Owen Rambow. Omnivox: Zero-shot emotion recognition with omni-llms. arXiv preprint arXiv:2503.21480, 2025

  25. [25]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  26. [26]

    Context-dependent sentiment analysis in user-generated videos

    Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 873–883, 2017

  27. [27]

    Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

  28. [28]

    A multimodal corpus for emotion recognition in sarcasm

    Anupama Ray, Shubham Mishra, Apoorva Nunna, and Pushpak Bhattacharyya. A multimodal corpus for emotion recognition in sarcasm. InProceedings of the thirteenth language resources and evaluation conference, pages 6992–7003, 2022. 11

  29. [29]

    Lr-gcn: Latent relation- aware graph convolutional network for conversational emotion recognition.IEEE Transactions on Multimedia, 24:4422–4432, 2021

    Minjie Ren, Xiangdong Huang, Wenhui Li, Dan Song, and Weizhi Nie. Lr-gcn: Latent relation- aware graph convolutional network for conversational emotion recognition.IEEE Transactions on Multimedia, 24:4422–4432, 2021

  30. [30]

    Ava active speaker: An audio-visual dataset for active speaker detection

    Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, et al. Ava active speaker: An audio-visual dataset for active speaker detection. InICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 449...

  31. [31]

    Multiemo: An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations

    Tao Shi and Shao-Lun Huang. Multiemo: An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14752–14766, 2023

  32. [32]

    Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

    Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. InProceedings of the 29th ACM international conference on multimedia, pages 3927–3935, 2021

  33. [33]

    Adaptive graph learning for multimodal conversational emotion detection.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19089–19097, 2024

    Geng Tu, Tian Xie, Bin Liang, Hongpeng Wang, and Ruifeng Xu. Adaptive graph learning for multimodal conversational emotion detection.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19089–19097, 2024

  34. [34]

    Fera 2017-addressing head pose in the third facial expression recognition and analysis challenge

    Michel F Valstar, Enrique Sánchez-Lozano, Jeffrey F Cohn, László A Jeni, Jeffrey M Girard, Zheng Zhang, Lijun Yin, and Maja Pantic. Fera 2017-addressing head pose in the third facial expression recognition and analysis challenge. In2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 839–847. IEEE, 2017

  35. [35]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  36. [36]

    Beyond silent letters: Amplifying llms in emotion recognition with vocal nuances

    Zehui Wu, Ziwei Gong, Lin Ai, Pengyuan Shi, Kaan Donbekci, and Julia Hirschberg. Beyond silent letters: Amplifying llms in emotion recognition with vocal nuances. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2202–2218, 2025

  37. [37]

    Multimodal fusion via hypergraph autoencoder and contrastive learning for emotion recognition in conversation

    Zijian Yi, Ziming Zhao, Zhishu Shen, and Tiehua Zhang. Multimodal fusion via hypergraph autoencoder and contrastive learning for emotion recognition in conversation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 4341–4348, 2024. doi: 10.1145/3664647.3681633

  38. [38]

    ECERC: Evidence-cause attention network for multi-modal emotion recognition in conversation

    Tao Zhang and Zhenhua Tan. ECERC: Evidence-cause attention network for multi-modal emotion recognition in conversation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2064–2077, Vienna, Austria, Ju...

  39. [39]

    Dialoguellm: Context and emotion knowledge-tuned large language models for emotion recognition in conversations.Neural Networks, page 107901, 2025

    Yazhou Zhang, Mengyao Wang, Youxi Wu, Prayag Tiwari, Qiuchi Li, Benyou Wang, and Jing Qin. Dialoguellm: Context and emotion knowledge-tuned large language models for emotion recognition in conversations.Neural Networks, page 107901, 2025

  40. [40]

    Unicon: Unified context network for robust active speaker detection

    Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan, and Xilin Chen. Unicon: Unified context network for robust active speaker detection. InProceedings of the 29th ACM international conference on multimedia, pages 3964–3972, 2021. 12