MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

Devamanyu Hazarika; Erik Cambria; Gautam Naik; Navonil Majumder; Rada Mihalcea; Soujanya Poria

arxiv: 1810.02508 · v6 · pith:GAU2XMPJnew · submitted 2018-10-05 · 💻 cs.CL

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

Soujanya Poria , Devamanyu Hazarika , Navonil Majumder , Gautam Naik , Erik Cambria , Rada Mihalcea This is my paper

classification 💻 cs.CL

keywords multimodalemotionconversationsdatasetmeldrecognitionemotionlinesmulti-party

0 comments

read the original abstract

Emotion recognition in conversations is a challenging task that has recently gained popularity due to its potential applications. Until now, however, a large-scale multimodal multi-party emotional conversational database containing more than two speakers per dialogue was missing. Thus, we propose the Multimodal EmotionLines Dataset (MELD), an extension and enhancement of EmotionLines. MELD contains about 13,000 utterances from 1,433 dialogues from the TV-series Friends. Each utterance is annotated with emotion and sentiment labels, and encompasses audio, visual and textual modalities. We propose several strong multimodal baselines and show the importance of contextual and multimodal information for emotion recognition in conversations. The full dataset is available for use at http:// affective-meld.github.io.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues
cs.SD 2026-05 unverdicted novelty 7.0

ToxiAlert-Bench dataset and dual-head neural network detect toxic speech by distinguishing textual versus paralinguistic sources, reporting 21.1% Macro-F1 and 13% accuracy gains over baselines.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
cs.CV 2026-04 unverdicted novelty 7.0

C-MET transfers emotions from speech to facial video by learning cross-modal semantic vectors with pretrained audio and disentangled expression encoders, yielding 14% higher emotion accuracy on MEAD and CREMA-D even f...
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
cs.SD 2025-07 unverdicted novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
cs.CV 2025-06 conditional novelty 7.0

SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
Deep Multimodal Learning with Missing Modality: A Survey
cs.CV 2024-09 unverdicted novelty 7.0

This survey provides the first comprehensive overview of deep multimodal learning methods designed to remain robust when some input modalities are absent.
Inter-Stance: A Dyadic Multimodal Corpus for Conversational Stance Analysis
cs.CV 2026-04 unverdicted novelty 6.0

A 20TB multimodal dyadic corpus with face video, thermal dynamics, voice, physiology, and stance annotations for 45 interactions enables new social signal modeling.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
cs.CL 2025-09 unverdicted novelty 6.0

StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
Benchmarking Gaslighting Attacks Against Speech Large Language Models
cs.CL 2025-09 unverdicted novelty 6.0

Gaslighting attacks using Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation strategies cause a 24.3% average accuracy drop in Speech LLMs while also triggering behavioral changes like apologies...
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...