Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition

Alessandro L. Koerich; Eric Granger; Juan D. S. Ortega; Marco Pedersoli; Mohammed Senoussaoui; Patrick Cardinal

arxiv: 1907.03196 · v1 · pith:QSO34SWDnew · submitted 2019-07-06 · 💻 cs.CV · eess.AS· eess.IV

Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition

Juan D. S. Ortega , Mohammed Senoussaoui , Eric Granger , Marco Pedersoli , Patrick Cardinal , Alessandro L. Koerich This is my paper

Pith reviewed 2026-05-25 01:21 UTC · model grok-4.3

classification 💻 cs.CV eess.ASeess.IV

keywords multimodal fusionemotion recognitiondeep neural networkaudio video textconcordance correlation coefficientAVEC datasetarousal valence liking

0 comments

The pith

A deep neural network with independent and shared layers fuses audio, video and text to predict emotions with higher concordance than standard early or late fusion methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a DNN architecture that routes each input modality through its own independent layers before merging information in shared layers to form a joint representation for emotion prediction. This structure is evaluated on the AVEC Sentiment Analysis in the Wild dataset and produces higher Concordance Correlation Coefficient scores than systems that simply concatenate features at the input or average scores at the output. The central goal is to demonstrate that learning modality-specific and combined representations together yields better arousal, valence and liking predictions than the two common fusion baselines. If the architecture succeeds, it indicates that explicit separation followed by sharing can improve multimodal emotion recognition without requiring complex hand-designed fusion rules.

Core claim

The DNN with independent layers for audio, video and text plus shared layers for fusion achieves CCC values of 0.606 for arousal, 0.534 for valence and 0.170 for liking on the AVEC development set, exceeding the performance of state-of-the-art early-fusion (feature concatenation) and late-fusion (score-weighted average) systems.

What carries the argument

The DNN architecture consisting of independent layers per modality and shared layers that jointly learn modality-specific representations together with an optimal combined representation.

If this is right

The architecture can be directly substituted into other multimodal emotion or sentiment pipelines that currently use concatenation or score averaging.
Performance on arousal, valence and liking can be expected to improve when the same layer separation is applied to new audio-video-text datasets of similar size.
Training time and model size remain comparable to baseline DNNs while delivering measurable gains in correlation metrics.
The method supports end-to-end training, removing the need for separate modality-specific classifiers before fusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same independent-plus-shared pattern may transfer to other multimodal tasks such as action recognition or speaker verification where cross-modal interactions matter.
If the shared layers capture interactions that simple fusion misses, adding more modalities should produce further gains without redesigning the fusion step.
Results on the development partition alone leave open whether the advantage holds on the hidden test partition or on entirely different corpora.

Load-bearing premise

The independent and shared layers learn a superior combined representation compared with standard early or late fusion on this dataset.

What would settle it

Another fusion approach achieving strictly higher CCC scores on the same AVEC development partition without the proposed independent-plus-shared layer structure would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 1907.03196 by Alessandro L. Koerich, Eric Granger, Juan D. S. Ortega, Marco Pedersoli, Mohammed Senoussaoui, Patrick Cardinal.

**Figure 1.** Figure 1: Proposed DNN architecture for multi modal fusion: (a) independent layers; (b) merge layer; (c) fully-connected layer; [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: shows how the CCC change in function of the delay. The suitable value for the delay compensation (optimized on the development partition) was chosen as d = 1.5 for arousal and valance, the curves behavior shows that this point is useful in order to increase the CCC for those dimensions on the other hand the Liking predictions shows a better behavior when the delay comes to be d = 2.5 427 481 499 530 539 46… view at source ↗

**Figure 3.** Figure 3: Relative importance of each modality in a late fusion [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

This paper presents a novel deep neural network (DNN) for multimodal fusion of audio, video and text modalities for emotion recognition. The proposed DNN architecture has independent and shared layers which aim to learn the representation for each modality, as well as the best combined representation to achieve the best prediction. Experimental results on the AVEC Sentiment Analysis in the Wild dataset indicate that the proposed DNN can achieve a higher level of Concordance Correlation Coefficient (CCC) than other state-of-the-art systems that perform early fusion of modalities at feature-level (i.e., concatenation) and late fusion at score-level (i.e., weighted average) fusion. The proposed DNN has achieved CCCs of 0.606, 0.534, and 0.170 on the development partition of the dataset for predicting arousal, valence and liking, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a DNN with independent modality layers plus shared fusion layers that beats some early/late fusion baselines on AVEC CCC, but the comparisons leave open whether the gain is from the architecture or from stronger unimodal features.

read the letter

The main thing here is a DNN architecture that keeps separate layers for audio, video, and text before feeding into shared fusion layers, and it posts CCC numbers of 0.606 arousal, 0.534 valence, and 0.170 liking on the AVEC development partition. Those numbers sit above the early-fusion and late-fusion baselines the abstract cites. That is the concrete result the paper offers. The design itself is the new piece; the individual components are standard DNN blocks, but the split between independent and shared layers is presented as the distinguishing choice. The work is straightforward about running on a public benchmark and giving the three CCC values, which is useful for anyone tracking incremental progress in multimodal affect recognition. The soft spot is the baseline comparison. The abstract claims superiority over other state-of-the-art early and late fusion systems, yet gives no sign that those systems were re-run with identical front-end extractors, optimizers, or hyper-parameters. If the numbers are taken from the original papers, any improvement could trace to better unimodal encoders rather than the fusion layers. That is a real gap for attributing the result to the proposed architecture. No architecture diagram, training schedule, or significance test appears in the abstract, so the support for the central claim stays limited to the reported point estimates. The paper does not contain equations that reduce to fitted parameters by construction, so there is no circularity issue. It reads as an honest experimental report rather than an over-claim. This is for researchers who follow affective computing and multimodal fusion on AVEC-style data; they can extract the architecture description and the three numbers for their own comparisons. It is incremental rather than foundational, but the experimental claim is clear enough that a serious editor should send it to referees so the baseline controls and implementation details can be checked.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a deep neural network architecture featuring independent and shared layers for multimodal fusion of audio, video, and text modalities aimed at emotion recognition. On the AVEC Sentiment Analysis in the Wild dataset, the proposed model reports Concordance Correlation Coefficient (CCC) scores of 0.606 for arousal, 0.534 for valence, and 0.170 for liking on the development partition, claiming superiority over state-of-the-art early fusion (feature concatenation) and late fusion (score-level weighted average) approaches.

Significance. If the experimental comparisons hold under controlled conditions with identical unimodal encoders and training procedures, the work would demonstrate the value of hybrid independent-shared layer designs in learning superior joint representations for continuous affect prediction, contributing to multimodal machine learning in affective computing.

major comments (2)

[Abstract] Abstract: The central claim that the proposed DNN outperforms other SOTA systems performing early and late fusion rests on reported CCC values, but the abstract provides no indication that the baseline systems were re-implemented with the same audio/video/text front-ends, optimizers, or hyperparameters on the AVEC development partition. This undermines attribution of gains to the fusion architecture rather than stronger unimodal encoders.
[Experimental Results] Experimental Results: No architecture diagram, training details, statistical significance tests, or full comparison tables are referenced, making it impossible to verify the support for the claim that independent and shared layers learn a superior combined representation compared with standard early or late fusion strategies.

minor comments (2)

[Abstract] The title refers to 'Audio-Video Emotion Recognition' while the abstract and claims include text modality; this inconsistency should be clarified.
The manuscript lacks details on the structure of independent and shared layers (e.g., layer counts, dimensions, or activation functions).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and verifiability of the experimental claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the proposed DNN outperforms other SOTA systems performing early and late fusion rests on reported CCC values, but the abstract provides no indication that the baseline systems were re-implemented with the same audio/video/text front-ends, optimizers, or hyperparameters on the AVEC development partition. This undermines attribution of gains to the fusion architecture rather than stronger unimodal encoders.

Authors: We agree that the abstract should explicitly address this point to avoid ambiguity. The early and late fusion baselines were re-implemented in our experiments using the same unimodal front-ends for audio, video, and text, along with identical optimizers and hyperparameters on the AVEC development partition. We will revise the abstract to state this clearly. revision: yes
Referee: [Experimental Results] Experimental Results: No architecture diagram, training details, statistical significance tests, or full comparison tables are referenced, making it impossible to verify the support for the claim that independent and shared layers learn a superior combined representation compared with standard early or late fusion strategies.

Authors: The referee is correct that these elements are needed for full verification. The manuscript describes the architecture and provides some training details in the Experimental Results section, but we will add an explicit diagram, expanded training procedures, statistical significance tests, and more complete comparison tables in the revision to better support the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture comparison on external benchmark

full rationale

The paper proposes a DNN architecture with independent and shared layers for audio-video-text fusion and reports measured CCC values (0.606/0.534/0.170) on the AVEC development partition. No equations, derivations, or fitted parameters are presented that reduce to the inputs by construction. The central claim rests on experimental comparison against numbers reported in prior literature for early/late fusion systems; those external results are falsifiable outside this paper and do not constitute a self-citation chain or self-definitional loop. The architecture choice is an ansatz validated by held-out performance rather than forced by definition or prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5694 in / 1074 out tokens · 24349 ms · 2026-05-25T01:21:48.357602+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

A. Ali, N. Dehak, P. Cardinal, S. Khuranam, S. H. Yella, P. Bell, and S. Renals. Automatic dialect detection in arabic broadcast speech. In Proc. of the 13th Annual Conf. of the Intl Speech Communication Association (Interspeech). , 2016

work page 2016
[2]

Cardinal, N

P. Cardinal, N. Dehak, A. L. Koerich, J. Alam, and P. Boucher. ETS System for A V+EC 2015 Challenge. In Proc. of the 5th Intl Workshop on Audio/Visual Emotion Challenge , pages 17–23, New York, New York, USA, 2015

work page 2015
[3]

Chen and Q

S. Chen and Q. Jin. Multi-modal dimensional emotion recognition using recurrent neural networks. In Proc. of the 5th Intl Workshop on Audio/Visual Emotion Challenge , pages 49–56, 2015

work page 2015
[4]

Chollet et al

F. Chollet et al. Keras. https://github.com/fchollet/ keras, 2015

work page 2015
[5]

J. F. Cohn, T. S. Kruez, I. Matthews, Y . Yang, M. H. Nguyen, M. T. Padilla, F. Zhou, and F. D. la Torre. Detecting depression from facial actions and vocal prosody. In 3rd Intl Conf. on Affective Computing and Intelligent Interaction and Workshops , pages 1–7, Sept 2009

work page 2009
[6]

M. J. Cossetin, J. C. Nievola, and A. L. Koerich. Facial expres- sion recognition using a pairwise feature selection and classiﬁcation approach. In International Joint Conference on Neural Networks (IJCNN’2016), pages 5149–5155. IEEE, 2016

work page 2016
[7]

Cummins, J

N. Cummins, J. Epps, and E. Ambikairajah. Spectro-temporal analysis of speech affected by depression and psychomotor retardation. In 2013 IEEE Intl Conf. on Acoustics, Speech and Signal Processing , pages 7542–7546, May 2013

work page 2013
[8]

D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, and M. Wilkes. Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans. on Biomedical Engineering , 47(7):829–837, July 2000

work page 2000
[9]

Y . Guo, G. Zhao, and M. Pietikinen. Dynamic facial expression recognition with atlas construction and sparse representation. IEEE Trans. on Image Processing , 25(5):1977–1992, May 2016

work page 1977
[10]

Hinton, L

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, Nov 2012

work page 2012
[11]

Huang, T

Z. Huang, T. Dang, N. Cummins, B. Stasak, P. Le, V . Sethu, and J. Epps. An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion predic- tion. In Proc. of the 5th Intl Workshop on Audio/Visual Emotion Challenge, pages 41–48, 2015

work page 2015
[12]

K ¨achele, M

M. K ¨achele, M. Glodek, D. Zharkov, S. Meudt, and F. Schwenker. Fusion of audio-visual features using hierarchical classiﬁer systems for the recognition of affective states and the state of depression. In Proc. of the 3rd Intl Conf. on Pattern Recognition Applications and Methods, pages 671–678, 2014

work page 2014
[13]

B.-K. Kim, H. Lee, J. Roh, and S.-Y . Lee. Hierarchical committee of deep cnns with exponentially-weighted decision fusion for static facial expression recognition. In Proc. of Intl Conf. on Multimodal Interaction, pages 427–434, New York, NY , USA, 2015

work page 2015
[14]

Kumari, R

J. Kumari, R. Rajesh, and K. Pooja. Facial expression recognition: A survey. Procedia Computer Science , 58:486 – 491, 2015. 2nd Intl Symposium on Computer Vision and the Internet

work page 2015
[15]

H. Meng, D. Huang, H. Wang, H. Yang, M. AI-Shuraiﬁ, and Y . Wang. Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In Proc. of the 3rd ACM Intl Workshop on Audio/Visual Emotion Challenge , pages 21– 30, October 2013

work page 2013
[16]

Moore, M

E. Moore, M. Clements, J. Peifer, and L. Weisser. Analysis of prosodic variation in speech for clinical depression. In Proc. of the 25th Annual Intl Conf. of the IEEE Engineering in Medicine and Biology Society , volume 3, pages 2925–2928, Sept 2003

work page 2003
[17]

Nasir, A

M. Nasir, A. Jati, P. G. Shivakumar, S. Nallan Chakravarthula, and P. Georgiou. Multimodal and multiresolution depression detection from speech and facial landmark features. In Proc. of the 6th Intl Workshop on Audio/Visual Emotion Challenge , pages 43–50, 2016

work page 2016
[18]

L. E. S. Oliveira, M. Mansano, A. L. Koerich, and A. S. Britto Jr. 2d principal component analysis for face and facial-expression recogni- tion. Computing in Science & Engineering , 13(3):9–13, 2011

work page 2011
[19]

Pantic and I

M. Pantic and I. Patras. Dynamics of facial expression: recognition of facial actions and their temporal segments from face proﬁle image sequences. IEEE Trans. on Systems, Man, and Cybernetics, Part B (Cybernetics), 36(2):433–449, April 2006

work page 2006
[20]

Ringeval, B

F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic. A VEC 2017 – Real-life Depression, and Affect Recognition Workshop and Chal- lenge. In Proc. of the 7th Intl Workshop on Audio/Visual Emotion Challenge, Mountain View, USA, October 2017

work page 2017
[21]

J. D. Silva Ortega, P. Cardinal, and A. L. Koerich. Emotion recognition using fusion of audio and video features. In IEEE International Conference on Systems, Man, and Cybernetics (SMC) , pages 1–6, 2019

work page 2019
[22]

Tajbakhsh, J

N. Tajbakhsh, J. Y . Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang. Convolutional neural networks for medical image analysis: Full training or ﬁne tuning? IEEE Trans. on Medical Imaging , 35(5):1299–1312, May 2016

work page 2016
[23]

D. L. Tannugi, A. S. Britto Jr., and A. L. Koerich. Memory integrity of cnns for cross-dataset facial expression recognition. In IEEE International Conference on Systems, Man, and Cybernetics (SMC) , pages 1–6, 2019

work page 2019
[24]

Trigeorgis, F

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In IEEE Intl Conf. on Acoustics, Speech and Signal Processing , pages 5200–5204, March 2016

work page 2016
[25]

J. R. Williamson, T. F. Quatieri, B. S. Helfer, G. Ciccarelli, and D. D. Mehta. V ocal and facial biomarkers of depression based on motor incoordination and timing. In Proc. of the 4th Intl Workshop on Audio/Visual Emotion Challenge , pages 65–72, 2014

work page 2014
[26]

T. H. H. Zavaschi, A. S. Britto Jr., L. E. S. Oliveira, and A. L. Koerich. Fusion of feature sets and classiﬁers for facial expression recognition. Expert Systems with Applications , 40(2):646–655, 2013

work page 2013
[27]

T. H. H. Zavaschi, A. L. Koerich, and L. E. S. Oliveira. Facial expression recognition using ensemble of classiﬁers. In 2011 ieee international conference on acoustics, speech and signal processing (ICASSP), pages 1489–1492. IEEE, 2011

work page 2011
[28]

Zhang, C

B. Zhang, C. Quan, and F. Ren. Study on cnn in the recognition of emotion in audio and images. In IEEE/ACIS 15th Intl Conf. on Computer and Information Science , pages 1–5, June 2016

work page 2016

[1] [1]

A. Ali, N. Dehak, P. Cardinal, S. Khuranam, S. H. Yella, P. Bell, and S. Renals. Automatic dialect detection in arabic broadcast speech. In Proc. of the 13th Annual Conf. of the Intl Speech Communication Association (Interspeech). , 2016

work page 2016

[2] [2]

Cardinal, N

P. Cardinal, N. Dehak, A. L. Koerich, J. Alam, and P. Boucher. ETS System for A V+EC 2015 Challenge. In Proc. of the 5th Intl Workshop on Audio/Visual Emotion Challenge , pages 17–23, New York, New York, USA, 2015

work page 2015

[3] [3]

Chen and Q

S. Chen and Q. Jin. Multi-modal dimensional emotion recognition using recurrent neural networks. In Proc. of the 5th Intl Workshop on Audio/Visual Emotion Challenge , pages 49–56, 2015

work page 2015

[4] [4]

Chollet et al

F. Chollet et al. Keras. https://github.com/fchollet/ keras, 2015

work page 2015

[5] [5]

J. F. Cohn, T. S. Kruez, I. Matthews, Y . Yang, M. H. Nguyen, M. T. Padilla, F. Zhou, and F. D. la Torre. Detecting depression from facial actions and vocal prosody. In 3rd Intl Conf. on Affective Computing and Intelligent Interaction and Workshops , pages 1–7, Sept 2009

work page 2009

[6] [6]

M. J. Cossetin, J. C. Nievola, and A. L. Koerich. Facial expres- sion recognition using a pairwise feature selection and classiﬁcation approach. In International Joint Conference on Neural Networks (IJCNN’2016), pages 5149–5155. IEEE, 2016

work page 2016

[7] [7]

Cummins, J

N. Cummins, J. Epps, and E. Ambikairajah. Spectro-temporal analysis of speech affected by depression and psychomotor retardation. In 2013 IEEE Intl Conf. on Acoustics, Speech and Signal Processing , pages 7542–7546, May 2013

work page 2013

[8] [8]

D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, and M. Wilkes. Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans. on Biomedical Engineering , 47(7):829–837, July 2000

work page 2000

[9] [9]

Y . Guo, G. Zhao, and M. Pietikinen. Dynamic facial expression recognition with atlas construction and sparse representation. IEEE Trans. on Image Processing , 25(5):1977–1992, May 2016

work page 1977

[10] [10]

Hinton, L

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, Nov 2012

work page 2012

[11] [11]

Huang, T

Z. Huang, T. Dang, N. Cummins, B. Stasak, P. Le, V . Sethu, and J. Epps. An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion predic- tion. In Proc. of the 5th Intl Workshop on Audio/Visual Emotion Challenge, pages 41–48, 2015

work page 2015

[12] [12]

K ¨achele, M

M. K ¨achele, M. Glodek, D. Zharkov, S. Meudt, and F. Schwenker. Fusion of audio-visual features using hierarchical classiﬁer systems for the recognition of affective states and the state of depression. In Proc. of the 3rd Intl Conf. on Pattern Recognition Applications and Methods, pages 671–678, 2014

work page 2014

[13] [13]

B.-K. Kim, H. Lee, J. Roh, and S.-Y . Lee. Hierarchical committee of deep cnns with exponentially-weighted decision fusion for static facial expression recognition. In Proc. of Intl Conf. on Multimodal Interaction, pages 427–434, New York, NY , USA, 2015

work page 2015

[14] [14]

Kumari, R

J. Kumari, R. Rajesh, and K. Pooja. Facial expression recognition: A survey. Procedia Computer Science , 58:486 – 491, 2015. 2nd Intl Symposium on Computer Vision and the Internet

work page 2015

[15] [15]

H. Meng, D. Huang, H. Wang, H. Yang, M. AI-Shuraiﬁ, and Y . Wang. Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In Proc. of the 3rd ACM Intl Workshop on Audio/Visual Emotion Challenge , pages 21– 30, October 2013

work page 2013

[16] [16]

Moore, M

E. Moore, M. Clements, J. Peifer, and L. Weisser. Analysis of prosodic variation in speech for clinical depression. In Proc. of the 25th Annual Intl Conf. of the IEEE Engineering in Medicine and Biology Society , volume 3, pages 2925–2928, Sept 2003

work page 2003

[17] [17]

Nasir, A

M. Nasir, A. Jati, P. G. Shivakumar, S. Nallan Chakravarthula, and P. Georgiou. Multimodal and multiresolution depression detection from speech and facial landmark features. In Proc. of the 6th Intl Workshop on Audio/Visual Emotion Challenge , pages 43–50, 2016

work page 2016

[18] [18]

L. E. S. Oliveira, M. Mansano, A. L. Koerich, and A. S. Britto Jr. 2d principal component analysis for face and facial-expression recogni- tion. Computing in Science & Engineering , 13(3):9–13, 2011

work page 2011

[19] [19]

Pantic and I

M. Pantic and I. Patras. Dynamics of facial expression: recognition of facial actions and their temporal segments from face proﬁle image sequences. IEEE Trans. on Systems, Man, and Cybernetics, Part B (Cybernetics), 36(2):433–449, April 2006

work page 2006

[20] [20]

Ringeval, B

F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic. A VEC 2017 – Real-life Depression, and Affect Recognition Workshop and Chal- lenge. In Proc. of the 7th Intl Workshop on Audio/Visual Emotion Challenge, Mountain View, USA, October 2017

work page 2017

[21] [21]

J. D. Silva Ortega, P. Cardinal, and A. L. Koerich. Emotion recognition using fusion of audio and video features. In IEEE International Conference on Systems, Man, and Cybernetics (SMC) , pages 1–6, 2019

work page 2019

[22] [22]

Tajbakhsh, J

N. Tajbakhsh, J. Y . Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang. Convolutional neural networks for medical image analysis: Full training or ﬁne tuning? IEEE Trans. on Medical Imaging , 35(5):1299–1312, May 2016

work page 2016

[23] [23]

D. L. Tannugi, A. S. Britto Jr., and A. L. Koerich. Memory integrity of cnns for cross-dataset facial expression recognition. In IEEE International Conference on Systems, Man, and Cybernetics (SMC) , pages 1–6, 2019

work page 2019

[24] [24]

Trigeorgis, F

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In IEEE Intl Conf. on Acoustics, Speech and Signal Processing , pages 5200–5204, March 2016

work page 2016

[25] [25]

J. R. Williamson, T. F. Quatieri, B. S. Helfer, G. Ciccarelli, and D. D. Mehta. V ocal and facial biomarkers of depression based on motor incoordination and timing. In Proc. of the 4th Intl Workshop on Audio/Visual Emotion Challenge , pages 65–72, 2014

work page 2014

[26] [26]

T. H. H. Zavaschi, A. S. Britto Jr., L. E. S. Oliveira, and A. L. Koerich. Fusion of feature sets and classiﬁers for facial expression recognition. Expert Systems with Applications , 40(2):646–655, 2013

work page 2013

[27] [27]

T. H. H. Zavaschi, A. L. Koerich, and L. E. S. Oliveira. Facial expression recognition using ensemble of classiﬁers. In 2011 ieee international conference on acoustics, speech and signal processing (ICASSP), pages 1489–1492. IEEE, 2011

work page 2011

[28] [28]

Zhang, C

B. Zhang, C. Quan, and F. Ren. Study on cnn in the recognition of emotion in audio and images. In IEEE/ACIS 15th Intl Conf. on Computer and Information Science , pages 1–5, June 2016

work page 2016