pith. sign in

arxiv: 1709.07200 · v1 · pith:TMH4CFCXnew · submitted 2017-09-21 · 💻 cs.CV · cs.LG· cs.MM

Temporal Multimodal Fusion for Video Emotion Classification in the Wild

classification 💻 cs.CV cs.LGcs.MM
keywords emotionlabelsclassificationdescribingfeaturesfusionmultimodalnovel
0
0 comments X
read the original abstract

This paper addresses the question of emotion classification. The task consists in predicting emotion labels (taken among a set of possible labels) best describing the emotions contained in short video clips. Building on a standard framework -- lying in describing videos by audio and visual features used by a supervised classifier to infer the labels -- this paper investigates several novel directions. First of all, improved face descriptors based on 2D and 3D Convo-lutional Neural Networks are proposed. Second, the paper explores several fusion methods, temporal and multimodal, including a novel hierarchical method combining features and scores. In addition, we carefully reviewed the different stages of the pipeline and designed a CNN architecture adapted to the task; this is important as the size of the training set is small compared to the difficulty of the problem, making generalization difficult. The so-obtained model ranked 4th at the 2017 Emotion in the Wild challenge with the accuracy of 58.8 %.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.