Emotion Recognition Using Fusion of Audio and Video Features

Alessandro L. Koerich; Juan D. S. Ortega; Patrick Cardinal

arxiv: 1906.10623 · v1 · pith:W3RA2TFKnew · submitted 2019-06-25 · 💻 cs.LG · cs.SD· eess.AS· stat.ML

Emotion Recognition Using Fusion of Audio and Video Features

Juan D. S. Ortega , Patrick Cardinal , Alessandro L. Koerich This is my paper

Pith reviewed 2026-05-25 16:23 UTC · model grok-4.3

classification 💻 cs.LG cs.SDeess.ASstat.ML

keywords emotion recognitionmultimodal fusionaudio featuresvideo featuresarousal valenceRECOLA datasetsupport vector regressioncontinuous emotion

0 comments

The pith

Combining pre-trained video features with audio descriptors via fusion improves prediction of arousal and valence in spontaneous emotions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes fusing visual features extracted from a pre-trained CNN with a set of minimalistic audio parameters like prosodic and spectral descriptors to recognize continuous emotions. This fusion is done either at the feature level or the prediction level using support vector regressors. On the RECOLA dataset of spontaneous emotions, the method achieves concordance correlation coefficients of 0.749 for arousal and 0.565 for valence. A sympathetic reader would care because accurate emotion recognition from audio and video could enable better human-computer interaction and mental health monitoring. The approach highlights how complementary information from two modalities can enhance performance over single modalities.

Core claim

The fusion of visual features from a pre-trained convolution neural network and minimalistic auditory descriptors, combined at feature or prediction level with a support vector regressor, takes advantage of complementary information and achieves CCCs of 0.749 and 0.565 for arousal and valence on the RECOLA dataset after preprocessing and post-processing.

What carries the argument

Feature-level or prediction-level fusion of CNN-extracted video features and prosodic/excitation/vocal tract/spectral audio descriptors fed into an SVR.

If this is right

Visual and auditory modalities provide complementary information that improves emotion prediction.
The fusion approach yields higher CCC for arousal (0.749) than valence (0.565).
Preprocessing and post-processing techniques contribute to better concordance correlation coefficients.
Single SVR after fusion or separate SVRs per modality both work for the task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar fusion might improve results on other emotion datasets with different recording conditions.
Real-time applications could benefit if the method is optimized for speed.
The choice of pre-trained CNN suggests transfer learning is key for visual emotion cues.

Load-bearing premise

The chosen pre-trained CNN features and minimalistic audio descriptors capture emotional content such that their fusion improves over single-modality baselines on spontaneous data.

What would settle it

A direct comparison showing that either video-only or audio-only features achieve equal or higher CCC than the fused model on the RECOLA dataset would falsify the benefit of fusion.

Figures

Figures reproduced from arXiv: 1906.10623 by Alessandro L. Koerich, Juan D. S. Ortega, Patrick Cardinal.

**Figure 2.** Figure 2: Using the fine-tuned CNN to extract features and train [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The CNN architecture for the FER dataset with three convolutional layers (CL) and two full connected (FC) layers. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Early fusion: video and audio features are concatenated and used to train an SVR; (b) Late fusion: video and [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Summary of the results by adding different elements: CNN alone; CNN with transfer learning and preprocessing [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

In this paper we propose a fusion approach to continuous emotion recognition that combines visual and auditory modalities in their representation spaces to predict the arousal and valence levels. The proposed approach employs a pre-trained convolution neural network and transfer learning to extract features from video frames that capture the emotional content. For the auditory content, a minimalistic set of parameters such as prosodic, excitation, vocal tract, and spectral descriptors are used as features. The fusion of these two modalities is carried out at a feature level, before training a single support vector regressor (SVR) or at a prediction level, after training one SVR for each modality. The proposed approach also includes preprocessing and post-processing techniques which contribute favorably to improving the concordance correlation coefficient (CCC). Experimental results for predicting spontaneous and natural emotions on the RECOLA dataset have shown that the proposed approach takes advantage of the complementary information of visual and auditory modalities and provides CCCs of 0.749 and 0.565 for arousal and valence, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard CNN transfer learning plus minimal audio features fused by SVR on RECOLA, but the abstract shows no baselines or ablations so the complementarity claim is untested.

read the letter

The main takeaway is a straightforward multimodal pipeline on RECOLA that combines pre-trained CNN video features with a small set of prosodic and spectral audio descriptors, fuses them either before or after separate SVRs, and adds preprocessing plus post-processing to reach CCC values of 0.749 for arousal and 0.565 for valence. Nothing in the abstract is new; the pieces are all established techniques applied to a public spontaneous-emotion corpus. The paper frames its own contribution as an incremental combination rather than a new framework or architecture. What it does supply is a concrete, reproducible-sounding recipe with explicit feature lists and two fusion strategies, plus the note that preprocessing helps the scores. That level of detail can be useful as a reference point for others working on the same dataset. The soft spots are exactly where the stress-test note points: the abstract gives no single-modality baselines, no ablation on the fusion step, no error bars, and no dataset statistics. Without those comparisons it is impossible to know whether the reported CCCs actually come from exploiting complementary information or from the post-processing alone. The assumption that the chosen pre-trained CNN and minimal audio descriptors capture emotion-specific content on spontaneous data therefore remains unverified in what is shown. This paper is aimed at affective-computing practitioners who need a working multimodal recipe for RECOLA rather than readers seeking new theoretical insight or large gains over prior art. I would send it for peer review so the full methods and any hidden comparisons can be checked; the numbers might turn out to be solid once the missing controls are supplied.

Referee Report

3 major / 2 minor

Summary. The paper proposes a multimodal fusion method for continuous emotion recognition that extracts visual features from video frames via a pre-trained CNN with transfer learning and audio features consisting of a minimalistic set of prosodic, excitation, vocal tract, and spectral descriptors. These modalities are fused either at the feature level (single SVR) or prediction level (separate SVRs per modality), with added preprocessing and post-processing steps. On the RECOLA dataset for spontaneous emotions, the method reports CCC values of 0.749 for arousal and 0.565 for valence, asserting that fusion exploits complementary information between modalities.

Significance. If supported by single-modality baselines and ablations, the work would show that straightforward transfer-learned visual features combined with basic audio descriptors can produce competitive CCC scores on spontaneous affect data via simple SVR fusion; the public RECOLA dataset supports reproducibility. The current manuscript, however, presents only the fused numbers without context for the claimed complementarity or improvement.

major comments (3)

[Abstract] Abstract: the claim that the approach 'takes advantage of the complementary information of visual and auditory modalities' cannot be evaluated because no single-modality CCC results, ablation studies, or unimodal baselines are reported alongside the fused values of 0.749 (arousal) and 0.565 (valence).
[Abstract] Abstract: the reported CCC numbers are given without error bars, cross-validation details, statistical tests, or basic dataset statistics (subject count, total duration, train/test split), so the reliability and generalizability of the performance cannot be assessed.
[Abstract] Abstract / Methods: the pre-trained CNN is described only as capturing 'emotional content' via transfer learning, with no specification of the source model, whether it was fine-tuned on affect data, or any evidence that the extracted features encode emotion-relevant rather than generic visual information.

minor comments (2)

The abstract refers to 'preprocessing and post-processing techniques which contribute favorably' but provides no description of what these steps are or their individual contributions.
Clarify the audio feature extraction pipeline (exact descriptors, dimensionality, normalization) and whether feature selection was performed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to provide the requested context, details, and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the approach 'takes advantage of the complementary information of visual and auditory modalities' cannot be evaluated because no single-modality CCC results, ablation studies, or unimodal baselines are reported alongside the fused values of 0.749 (arousal) and 0.565 (valence).

Authors: We agree that unimodal baselines are required to support the complementarity claim. In the revised manuscript we will add CCC results for audio-only and video-only SVR models using identical preprocessing, post-processing, and hyperparameter settings, plus an ablation comparing feature-level versus decision-level fusion against the unimodal cases. revision: yes
Referee: [Abstract] Abstract: the reported CCC numbers are given without error bars, cross-validation details, statistical tests, or basic dataset statistics (subject count, total duration, train/test split), so the reliability and generalizability of the performance cannot be assessed.

Authors: The full paper already states the RECOLA corpus contains 16 subjects and follows the standard challenge partition; we will move these statistics into the abstract. We will also specify the cross-validation scheme (subject-independent) and, where multiple runs exist, report standard deviation across folds or seeds. If only a single evaluation was performed we will state this explicitly rather than imply statistical testing. revision: yes
Referee: [Abstract] Abstract / Methods: the pre-trained CNN is described only as capturing 'emotional content' via transfer learning, with no specification of the source model, whether it was fine-tuned on affect data, or any evidence that the extracted features encode emotion-relevant rather than generic visual information.

Authors: We will expand both the abstract and methods to name the exact CNN architecture and pre-training corpus (e.g., VGG-Face or ImageNet), state whether any fine-tuning on affective data occurred, and add a short justification—supported by cited prior work—explaining why the chosen layer activations are expected to carry affect-related information. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance metrics on external dataset

full rationale

The paper's central results are concordance correlation coefficients (0.749 arousal, 0.565 valence) obtained by extracting pre-trained CNN visual features and minimalistic audio descriptors, fusing them, and training SVR regressors on the public RECOLA dataset. No equations, self-definitions, fitted-input renamings, or self-citation chains reduce these reported numbers to the inputs by construction. The derivation consists of standard feature extraction, fusion, and regression steps whose outputs are externally falsifiable on held-out data; the reported CCC values are not tautological with the chosen features or preprocessing. This is a self-contained empirical evaluation with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on two domain assumptions about feature quality and on the choice of SVR hyperparameters; no new entities are postulated.

free parameters (2)

SVR hyperparameters and fusion weights
Chosen to optimize CCC on the target data; exact values not stated in abstract.
Audio feature subset selection
Which prosodic, excitation, vocal-tract and spectral descriptors are retained is a modeling choice.

axioms (2)

domain assumption Pre-trained CNN on general images extracts emotion-relevant visual features
Transfer learning is invoked without domain-specific validation in the abstract.
domain assumption Minimalistic audio descriptors suffice to represent auditory emotional content
The paper states the set is used but does not demonstrate necessity or sufficiency.

pith-pipeline@v0.9.0 · 5712 in / 1266 out tokens · 33809 ms · 2026-05-25T16:23:45.123729+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

employs a pre-trained convolution neural network and transfer learning to extract features from video frames... minimalistic set of parameters such as prosodic, excitation, vocal tract, and spectral descriptors... fusion... at a feature level... or at a prediction level... SVR
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

provides CCCs of 0.749 and 0.565 for arousal and valence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

[1]

Cardinal, N

P. Cardinal, N. Dehak, A. L. Koerich, J. Alam, and P. Boucher. ETS System for A V+EC 2015. In 5th Intl Workshop on Audio/Visual Emotion Challenge , pages 17–23, 2015

work page 2015
[2]

M. J. Cossetin, J. C. Nievola, and A. L. Koerich. Facial expression recognition using a pairwise feature selection and classiﬁcation ap- proach. In Intl Joint Conf on Neural Networks , pages 5149–5155, 2016

work page 2016
[3]

W. Ding, M. Xu, D. Huang, W. Lin, M. Dong, X. Yu, and H. Li. Audio and face video emotion recognition in the wild using deep neural networks and small datasets. In 18th ACM Intl Conf on Multimodal Interaction, pages 506–513, New York, USA, 2016

work page 2016
[4]

C. T. Duong, R. Lebret, and K. Aberer. Multimodal classiﬁcation for analysing social media. CoRR, abs/1708.02099, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

Eyben, M

F. Eyben, M. Kaiser, B. Schuller, G. Rigoll, and M. W ¨ollmer. LSTM- Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Comp , 31(2):153–163, 2012

work page 2012
[6]

I. J. Goodfellow, D. Erhan, P. L. Carrier, and A. Courville. Challenges in representation learning: A report on three machine learning contests. Springer-V erlag, 8228:117–124, 2013

work page 2013
[7]

Y . Li, J. Tao, B. Schuller, S. Shan, D. Jiang, and J. Jia. Mec 2017: Multimodal emot recog challenge. 1st Asian Conf Affect Comp Intell Interaction, pages 1–5, 2017

work page 2017
[8]

P. P. Liang, A. Zadeh, and L.-P. Morency. Multimodal local-global ranking fusion for emot recognition. 20th ACM Intl Conf on Multi- modal Interaction , pages 472–476, 2018

work page 2018
[9]

Mariooryad and C

S. Mariooryad and C. Busso. Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In 2013 Humaine Assoc Conf Affect Comp Intell Interaction , pages 85– 90, 2013

work page 2013
[10]

Ringeval, F

F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-p. P. Thiran, T. Ebrahimi, D. Lalanne, and B. Schuller. Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Patt Recog Lett, 66:22–30, 2015

work page 2015
[11]

J. A. Russell. A circumplex model of affect. J of Personality and Social Psychology , 39(6):1161–1178, 1989

work page 1989
[12]

C. G. M. Snoek, M. Worring, and A. W. M. Smeulders. Early versus late fusion in semantic video analysis. In 13th Annual ACM Intl Conf on Multimedia , pages 399–402, New York, USA, 2005

work page 2005
[13]

B. Sun, S. Cao, L. Li, J. He, and L. Yu. Exploring multimodal visual features for continuous affect recognition. In 6th Intl Workshop on Audio/Visual Emotion Challenge, pages 83–88, New York, USA, 2016

work page 2016
[14]

D. C. Tannugi, A. S. Britto Jr., and A. L. Koerich. Memory integrity of CNNs for cross-dataset facial expression recognition. CoRR, abs/1905.12082, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[15]

Trigeorgis, F

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, and B. Schuller. Adieu features? end-to-end speech emotion recog- nition using a deep convolutional recurrent network. IEEE Intl Conf Acoust Speech and Signal Proc , pages 5200–5204, 2016

work page 2016
[16]

Tzirakis, G

P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou. End-to-End Multimodal Emotion Recognition Using Deep Neural Networks. IEEE J of Selec Top in Signal Proc , 11(8):1301–1309, 2017

work page 2017
[17]

M. F. Valstar, J. Gratch, B. W. Schuller, F. Ringeval, D. Lalanne, M. Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic. A VEC 2016 workshop and challenge. CoRR, abs/1605.01600, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

W. Wan, C. Yang, and Y . Li. Facial expression recognition using convolutional neural network. pages 1–6, 2016

work page 2016
[19]

Weber, V

R. Weber, V . Barrielle, C. Soladi ´e, and R. S ´eguier. High-Level Geometry-based Features of Video Modality for Emotion Prediction. In 6th Intl Workshop on Audio/Visual Emotion Challenge , pages 51– 58, New York, USA, 2016

work page 2016
[20]

J. Yan, W. Zheng, Z. Cui, C. Tang, T. Zhang, and Y . Zong. Multi-cue fusion for emotion recognition in the wild. Neurocomputing, 309:27– 35, 2018

work page 2018
[21]

K. Yan, W. Zheng, T. Zhang, Y . Zong, and Z. Cui. Cross-database non-frontal facial expression recognition based on transductive deep transfer learning. pages 1–8, 2018

work page 2018
[22]

X. Zhu, L. Li, W. Zhang, T. Rao, M. Xu, Q. Huang, and D. Xu. De- pendency exploitation: A uniﬁed cnn-rnn approach for visual emotion recognition. 26th Intl Joint Conf Artif Intell , pages 3595–3601, 2017

work page 2017

[1] [1]

Cardinal, N

P. Cardinal, N. Dehak, A. L. Koerich, J. Alam, and P. Boucher. ETS System for A V+EC 2015. In 5th Intl Workshop on Audio/Visual Emotion Challenge , pages 17–23, 2015

work page 2015

[2] [2]

M. J. Cossetin, J. C. Nievola, and A. L. Koerich. Facial expression recognition using a pairwise feature selection and classiﬁcation ap- proach. In Intl Joint Conf on Neural Networks , pages 5149–5155, 2016

work page 2016

[3] [3]

W. Ding, M. Xu, D. Huang, W. Lin, M. Dong, X. Yu, and H. Li. Audio and face video emotion recognition in the wild using deep neural networks and small datasets. In 18th ACM Intl Conf on Multimodal Interaction, pages 506–513, New York, USA, 2016

work page 2016

[4] [4]

C. T. Duong, R. Lebret, and K. Aberer. Multimodal classiﬁcation for analysing social media. CoRR, abs/1708.02099, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

Eyben, M

F. Eyben, M. Kaiser, B. Schuller, G. Rigoll, and M. W ¨ollmer. LSTM- Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Comp , 31(2):153–163, 2012

work page 2012

[6] [6]

I. J. Goodfellow, D. Erhan, P. L. Carrier, and A. Courville. Challenges in representation learning: A report on three machine learning contests. Springer-V erlag, 8228:117–124, 2013

work page 2013

[7] [7]

Y . Li, J. Tao, B. Schuller, S. Shan, D. Jiang, and J. Jia. Mec 2017: Multimodal emot recog challenge. 1st Asian Conf Affect Comp Intell Interaction, pages 1–5, 2017

work page 2017

[8] [8]

P. P. Liang, A. Zadeh, and L.-P. Morency. Multimodal local-global ranking fusion for emot recognition. 20th ACM Intl Conf on Multi- modal Interaction , pages 472–476, 2018

work page 2018

[9] [9]

Mariooryad and C

S. Mariooryad and C. Busso. Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In 2013 Humaine Assoc Conf Affect Comp Intell Interaction , pages 85– 90, 2013

work page 2013

[10] [10]

Ringeval, F

F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-p. P. Thiran, T. Ebrahimi, D. Lalanne, and B. Schuller. Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Patt Recog Lett, 66:22–30, 2015

work page 2015

[11] [11]

J. A. Russell. A circumplex model of affect. J of Personality and Social Psychology , 39(6):1161–1178, 1989

work page 1989

[12] [12]

C. G. M. Snoek, M. Worring, and A. W. M. Smeulders. Early versus late fusion in semantic video analysis. In 13th Annual ACM Intl Conf on Multimedia , pages 399–402, New York, USA, 2005

work page 2005

[13] [13]

B. Sun, S. Cao, L. Li, J. He, and L. Yu. Exploring multimodal visual features for continuous affect recognition. In 6th Intl Workshop on Audio/Visual Emotion Challenge, pages 83–88, New York, USA, 2016

work page 2016

[14] [14]

D. C. Tannugi, A. S. Britto Jr., and A. L. Koerich. Memory integrity of CNNs for cross-dataset facial expression recognition. CoRR, abs/1905.12082, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[15] [15]

Trigeorgis, F

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, and B. Schuller. Adieu features? end-to-end speech emotion recog- nition using a deep convolutional recurrent network. IEEE Intl Conf Acoust Speech and Signal Proc , pages 5200–5204, 2016

work page 2016

[16] [16]

Tzirakis, G

P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou. End-to-End Multimodal Emotion Recognition Using Deep Neural Networks. IEEE J of Selec Top in Signal Proc , 11(8):1301–1309, 2017

work page 2017

[17] [17]

M. F. Valstar, J. Gratch, B. W. Schuller, F. Ringeval, D. Lalanne, M. Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic. A VEC 2016 workshop and challenge. CoRR, abs/1605.01600, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[18] [18]

W. Wan, C. Yang, and Y . Li. Facial expression recognition using convolutional neural network. pages 1–6, 2016

work page 2016

[19] [19]

Weber, V

R. Weber, V . Barrielle, C. Soladi ´e, and R. S ´eguier. High-Level Geometry-based Features of Video Modality for Emotion Prediction. In 6th Intl Workshop on Audio/Visual Emotion Challenge , pages 51– 58, New York, USA, 2016

work page 2016

[20] [20]

J. Yan, W. Zheng, Z. Cui, C. Tang, T. Zhang, and Y . Zong. Multi-cue fusion for emotion recognition in the wild. Neurocomputing, 309:27– 35, 2018

work page 2018

[21] [21]

K. Yan, W. Zheng, T. Zhang, Y . Zong, and Z. Cui. Cross-database non-frontal facial expression recognition based on transductive deep transfer learning. pages 1–8, 2018

work page 2018

[22] [22]

X. Zhu, L. Li, W. Zhang, T. Rao, M. Xu, Q. Huang, and D. Xu. De- pendency exploitation: A uniﬁed cnn-rnn approach for visual emotion recognition. 26th Intl Joint Conf Artif Intell , pages 3595–3601, 2017

work page 2017