pith. sign in

arxiv: 1802.08332 · v1 · pith:R3FWYT23new · submitted 2018-02-22 · 💻 cs.CL

Deep Multimodal Learning for Emotion Recognition in Spoken Language

classification 💻 cs.CL
keywords deepfeaturesmultimodalaudioemotionframeworkhigh-levelinformation
0
0 comments X
read the original abstract

In this paper, we present a novel deep multimodal framework to predict human emotions based on sentence-level spoken language. Our architecture has two distinctive characteristics. First, it extracts the high-level features from both text and audio via a hybrid deep multimodal structure, which considers the spatial information from text, temporal information from audio, and high-level associations from low-level handcrafted features. Second, we fuse all features by using a three-layer deep neural network to learn the correlations across modalities and train the feature extraction and fusion modules together, allowing optimal global fine-tuning of the entire structure. We evaluated the proposed framework on the IEMOCAP dataset. Our result shows promising performance, achieving 60.4% in weighted accuracy for five emotion categories.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.