pith. sign in

arxiv: 2503.21806 · v1 · pith:75QZAPZJnew · submitted 2025-03-25 · 💻 cs.CL · cs.AI

Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages

classification 💻 cs.CL cs.AI
keywords speechemotionmultilingualrecognitionzero-shotlanguagesacrosscontrastive
0
0 comments X
read the original abstract

Multilingual speech emotion recognition aims to estimate a speaker's emotional state using a contactless method across different languages. However, variability in voice characteristics and linguistic diversity poses significant challenges for zero-shot speech emotion recognition, especially with multilingual datasets. In this paper, we propose leveraging contrastive learning to refine multilingual speech features and extend large language models for zero-shot multilingual speech emotion estimation. Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space, capturing both emotion-aware and language-agnostic speech representations. To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER. Our experiments demonstrate the effectiveness of the proposed method in both speech emotion recognition and zero-shot multilingual speech emotion recognition, including previously unseen datasets and languages.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.