pith. sign in

arxiv: 1906.10623 · v1 · pith:W3RA2TFKnew · submitted 2019-06-25 · 💻 cs.LG · cs.SD· eess.AS· stat.ML

Emotion Recognition Using Fusion of Audio and Video Features

Pith reviewed 2026-05-25 16:23 UTC · model grok-4.3

classification 💻 cs.LG cs.SDeess.ASstat.ML
keywords emotion recognitionmultimodal fusionaudio featuresvideo featuresarousal valenceRECOLA datasetsupport vector regressioncontinuous emotion
0
0 comments X

The pith

Combining pre-trained video features with audio descriptors via fusion improves prediction of arousal and valence in spontaneous emotions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes fusing visual features extracted from a pre-trained CNN with a set of minimalistic audio parameters like prosodic and spectral descriptors to recognize continuous emotions. This fusion is done either at the feature level or the prediction level using support vector regressors. On the RECOLA dataset of spontaneous emotions, the method achieves concordance correlation coefficients of 0.749 for arousal and 0.565 for valence. A sympathetic reader would care because accurate emotion recognition from audio and video could enable better human-computer interaction and mental health monitoring. The approach highlights how complementary information from two modalities can enhance performance over single modalities.

Core claim

The fusion of visual features from a pre-trained convolution neural network and minimalistic auditory descriptors, combined at feature or prediction level with a support vector regressor, takes advantage of complementary information and achieves CCCs of 0.749 and 0.565 for arousal and valence on the RECOLA dataset after preprocessing and post-processing.

What carries the argument

Feature-level or prediction-level fusion of CNN-extracted video features and prosodic/excitation/vocal tract/spectral audio descriptors fed into an SVR.

If this is right

  • Visual and auditory modalities provide complementary information that improves emotion prediction.
  • The fusion approach yields higher CCC for arousal (0.749) than valence (0.565).
  • Preprocessing and post-processing techniques contribute to better concordance correlation coefficients.
  • Single SVR after fusion or separate SVRs per modality both work for the task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fusion might improve results on other emotion datasets with different recording conditions.
  • Real-time applications could benefit if the method is optimized for speed.
  • The choice of pre-trained CNN suggests transfer learning is key for visual emotion cues.

Load-bearing premise

The chosen pre-trained CNN features and minimalistic audio descriptors capture emotional content such that their fusion improves over single-modality baselines on spontaneous data.

What would settle it

A direct comparison showing that either video-only or audio-only features achieve equal or higher CCC than the fused model on the RECOLA dataset would falsify the benefit of fusion.

Figures

Figures reproduced from arXiv: 1906.10623 by Alessandro L. Koerich, Juan D. S. Ortega, Patrick Cardinal.

Figure 1
Figure 1. Figure 1: (a) Pre-training a CNN in a source dataset; (b) Fine-tuning the CNN on the target dataset. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Using the fine-tuned CNN to extract features and train [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The CNN architecture for the FER dataset with three convolutional layers (CL) and two full connected (FC) layers. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Early fusion: video and audio features are concatenated and used to train an SVR; (b) Late fusion: video and [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Summary of the results by adding different elements: CNN alone; CNN with transfer learning and preprocessing [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

In this paper we propose a fusion approach to continuous emotion recognition that combines visual and auditory modalities in their representation spaces to predict the arousal and valence levels. The proposed approach employs a pre-trained convolution neural network and transfer learning to extract features from video frames that capture the emotional content. For the auditory content, a minimalistic set of parameters such as prosodic, excitation, vocal tract, and spectral descriptors are used as features. The fusion of these two modalities is carried out at a feature level, before training a single support vector regressor (SVR) or at a prediction level, after training one SVR for each modality. The proposed approach also includes preprocessing and post-processing techniques which contribute favorably to improving the concordance correlation coefficient (CCC). Experimental results for predicting spontaneous and natural emotions on the RECOLA dataset have shown that the proposed approach takes advantage of the complementary information of visual and auditory modalities and provides CCCs of 0.749 and 0.565 for arousal and valence, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a multimodal fusion method for continuous emotion recognition that extracts visual features from video frames via a pre-trained CNN with transfer learning and audio features consisting of a minimalistic set of prosodic, excitation, vocal tract, and spectral descriptors. These modalities are fused either at the feature level (single SVR) or prediction level (separate SVRs per modality), with added preprocessing and post-processing steps. On the RECOLA dataset for spontaneous emotions, the method reports CCC values of 0.749 for arousal and 0.565 for valence, asserting that fusion exploits complementary information between modalities.

Significance. If supported by single-modality baselines and ablations, the work would show that straightforward transfer-learned visual features combined with basic audio descriptors can produce competitive CCC scores on spontaneous affect data via simple SVR fusion; the public RECOLA dataset supports reproducibility. The current manuscript, however, presents only the fused numbers without context for the claimed complementarity or improvement.

major comments (3)
  1. [Abstract] Abstract: the claim that the approach 'takes advantage of the complementary information of visual and auditory modalities' cannot be evaluated because no single-modality CCC results, ablation studies, or unimodal baselines are reported alongside the fused values of 0.749 (arousal) and 0.565 (valence).
  2. [Abstract] Abstract: the reported CCC numbers are given without error bars, cross-validation details, statistical tests, or basic dataset statistics (subject count, total duration, train/test split), so the reliability and generalizability of the performance cannot be assessed.
  3. [Abstract] Abstract / Methods: the pre-trained CNN is described only as capturing 'emotional content' via transfer learning, with no specification of the source model, whether it was fine-tuned on affect data, or any evidence that the extracted features encode emotion-relevant rather than generic visual information.
minor comments (2)
  1. The abstract refers to 'preprocessing and post-processing techniques which contribute favorably' but provides no description of what these steps are or their individual contributions.
  2. Clarify the audio feature extraction pipeline (exact descriptors, dimensionality, normalization) and whether feature selection was performed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to provide the requested context, details, and clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the approach 'takes advantage of the complementary information of visual and auditory modalities' cannot be evaluated because no single-modality CCC results, ablation studies, or unimodal baselines are reported alongside the fused values of 0.749 (arousal) and 0.565 (valence).

    Authors: We agree that unimodal baselines are required to support the complementarity claim. In the revised manuscript we will add CCC results for audio-only and video-only SVR models using identical preprocessing, post-processing, and hyperparameter settings, plus an ablation comparing feature-level versus decision-level fusion against the unimodal cases. revision: yes

  2. Referee: [Abstract] Abstract: the reported CCC numbers are given without error bars, cross-validation details, statistical tests, or basic dataset statistics (subject count, total duration, train/test split), so the reliability and generalizability of the performance cannot be assessed.

    Authors: The full paper already states the RECOLA corpus contains 16 subjects and follows the standard challenge partition; we will move these statistics into the abstract. We will also specify the cross-validation scheme (subject-independent) and, where multiple runs exist, report standard deviation across folds or seeds. If only a single evaluation was performed we will state this explicitly rather than imply statistical testing. revision: yes

  3. Referee: [Abstract] Abstract / Methods: the pre-trained CNN is described only as capturing 'emotional content' via transfer learning, with no specification of the source model, whether it was fine-tuned on affect data, or any evidence that the extracted features encode emotion-relevant rather than generic visual information.

    Authors: We will expand both the abstract and methods to name the exact CNN architecture and pre-training corpus (e.g., VGG-Face or ImageNet), state whether any fine-tuning on affective data occurred, and add a short justification—supported by cited prior work—explaining why the chosen layer activations are expected to carry affect-related information. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance metrics on external dataset

full rationale

The paper's central results are concordance correlation coefficients (0.749 arousal, 0.565 valence) obtained by extracting pre-trained CNN visual features and minimalistic audio descriptors, fusing them, and training SVR regressors on the public RECOLA dataset. No equations, self-definitions, fitted-input renamings, or self-citation chains reduce these reported numbers to the inputs by construction. The derivation consists of standard feature extraction, fusion, and regression steps whose outputs are externally falsifiable on held-out data; the reported CCC values are not tautological with the chosen features or preprocessing. This is a self-contained empirical evaluation with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on two domain assumptions about feature quality and on the choice of SVR hyperparameters; no new entities are postulated.

free parameters (2)
  • SVR hyperparameters and fusion weights
    Chosen to optimize CCC on the target data; exact values not stated in abstract.
  • Audio feature subset selection
    Which prosodic, excitation, vocal-tract and spectral descriptors are retained is a modeling choice.
axioms (2)
  • domain assumption Pre-trained CNN on general images extracts emotion-relevant visual features
    Transfer learning is invoked without domain-specific validation in the abstract.
  • domain assumption Minimalistic audio descriptors suffice to represent auditory emotional content
    The paper states the set is used but does not demonstrate necessity or sufficiency.

pith-pipeline@v0.9.0 · 5712 in / 1266 out tokens · 33809 ms · 2026-05-25T16:23:45.123729+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

  1. [1]

    Cardinal, N

    P. Cardinal, N. Dehak, A. L. Koerich, J. Alam, and P. Boucher. ETS System for A V+EC 2015. In 5th Intl Workshop on Audio/Visual Emotion Challenge , pages 17–23, 2015

  2. [2]

    M. J. Cossetin, J. C. Nievola, and A. L. Koerich. Facial expression recognition using a pairwise feature selection and classification ap- proach. In Intl Joint Conf on Neural Networks , pages 5149–5155, 2016

  3. [3]

    W. Ding, M. Xu, D. Huang, W. Lin, M. Dong, X. Yu, and H. Li. Audio and face video emotion recognition in the wild using deep neural networks and small datasets. In 18th ACM Intl Conf on Multimodal Interaction, pages 506–513, New York, USA, 2016

  4. [4]

    C. T. Duong, R. Lebret, and K. Aberer. Multimodal classification for analysing social media. CoRR, abs/1708.02099, 2017

  5. [5]

    Eyben, M

    F. Eyben, M. Kaiser, B. Schuller, G. Rigoll, and M. W ¨ollmer. LSTM- Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Comp , 31(2):153–163, 2012

  6. [6]

    I. J. Goodfellow, D. Erhan, P. L. Carrier, and A. Courville. Challenges in representation learning: A report on three machine learning contests. Springer-V erlag, 8228:117–124, 2013

  7. [7]

    Y . Li, J. Tao, B. Schuller, S. Shan, D. Jiang, and J. Jia. Mec 2017: Multimodal emot recog challenge. 1st Asian Conf Affect Comp Intell Interaction, pages 1–5, 2017

  8. [8]

    P. P. Liang, A. Zadeh, and L.-P. Morency. Multimodal local-global ranking fusion for emot recognition. 20th ACM Intl Conf on Multi- modal Interaction , pages 472–476, 2018

  9. [9]

    Mariooryad and C

    S. Mariooryad and C. Busso. Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In 2013 Humaine Assoc Conf Affect Comp Intell Interaction , pages 85– 90, 2013

  10. [10]

    Ringeval, F

    F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-p. P. Thiran, T. Ebrahimi, D. Lalanne, and B. Schuller. Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Patt Recog Lett, 66:22–30, 2015

  11. [11]

    J. A. Russell. A circumplex model of affect. J of Personality and Social Psychology , 39(6):1161–1178, 1989

  12. [12]

    C. G. M. Snoek, M. Worring, and A. W. M. Smeulders. Early versus late fusion in semantic video analysis. In 13th Annual ACM Intl Conf on Multimedia , pages 399–402, New York, USA, 2005

  13. [13]

    B. Sun, S. Cao, L. Li, J. He, and L. Yu. Exploring multimodal visual features for continuous affect recognition. In 6th Intl Workshop on Audio/Visual Emotion Challenge, pages 83–88, New York, USA, 2016

  14. [14]

    D. C. Tannugi, A. S. Britto Jr., and A. L. Koerich. Memory integrity of CNNs for cross-dataset facial expression recognition. CoRR, abs/1905.12082, 2019

  15. [15]

    Trigeorgis, F

    G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, and B. Schuller. Adieu features? end-to-end speech emotion recog- nition using a deep convolutional recurrent network. IEEE Intl Conf Acoust Speech and Signal Proc , pages 5200–5204, 2016

  16. [16]

    Tzirakis, G

    P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou. End-to-End Multimodal Emotion Recognition Using Deep Neural Networks. IEEE J of Selec Top in Signal Proc , 11(8):1301–1309, 2017

  17. [17]

    M. F. Valstar, J. Gratch, B. W. Schuller, F. Ringeval, D. Lalanne, M. Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic. A VEC 2016 workshop and challenge. CoRR, abs/1605.01600, 2016

  18. [18]

    W. Wan, C. Yang, and Y . Li. Facial expression recognition using convolutional neural network. pages 1–6, 2016

  19. [19]

    Weber, V

    R. Weber, V . Barrielle, C. Soladi ´e, and R. S ´eguier. High-Level Geometry-based Features of Video Modality for Emotion Prediction. In 6th Intl Workshop on Audio/Visual Emotion Challenge , pages 51– 58, New York, USA, 2016

  20. [20]

    J. Yan, W. Zheng, Z. Cui, C. Tang, T. Zhang, and Y . Zong. Multi-cue fusion for emotion recognition in the wild. Neurocomputing, 309:27– 35, 2018

  21. [21]

    K. Yan, W. Zheng, T. Zhang, Y . Zong, and Z. Cui. Cross-database non-frontal facial expression recognition based on transductive deep transfer learning. pages 1–8, 2018

  22. [22]

    X. Zhu, L. Li, W. Zhang, T. Rao, M. Xu, Q. Huang, and D. Xu. De- pendency exploitation: A unified cnn-rnn approach for visual emotion recognition. 26th Intl Joint Conf Artif Intell , pages 3595–3601, 2017