Multi-attention Recurrent Network for Human Communication Comprehension

Amir Zadeh; Erik Cambria; Louis-Philippe Morency; Paul Pu Liang; Prateek Vij; Soujanya Poria

arxiv: 1802.00923 · v1 · pith:QFQAM2ATnew · submitted 2018-02-03 · 💻 cs.AI · cs.CL· cs.LG

Multi-attention Recurrent Network for Human Communication Comprehension

Amir Zadeh , Paul Pu Liang , Soujanya Poria , Prateek Vij , Erik Cambria , Louis-Philippe Morency This is my paper

classification 💻 cs.AI cs.CLcs.LG

keywords communicationhumanmodalitycalledmulti-attentionrecurrentcomponentdatasets

0 comments

read the original abstract

Human face-to-face communication is a complex multimodal signal. We use words (language modality), gestures (vision modality) and changes in tone (acoustic modality) to convey our intentions. Humans easily process and understand face-to-face communication, however, comprehending this form of communication remains a significant challenge for Artificial Intelligence (AI). AI must understand each modality and the interactions between them that shape human communication. In this paper, we present a novel neural architecture for understanding human communication called the Multi-attention Recurrent Network (MARN). The main strength of our model comes from discovering interactions between modalities through time using a neural component called the Multi-attention Block (MAB) and storing them in the hybrid memory of a recurrent component called the Long-short Term Hybrid Memory (LSTHM). We perform extensive comparisons on six publicly available datasets for multimodal sentiment analysis, speaker trait recognition and emotion recognition. MARN shows state-of-the-art performance on all the datasets.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis
cs.IR 2019-07 unverdicted novelty 3.0

One-step DCCA fusing BERT text with audio and video embeddings outperforms prior multi-modal methods for sentiment classification on two benchmarks and a new Debate Emotion dataset.