pith. sign in

arxiv: 1907.10428 · v1 · pith:7OYYHJVPnew · submitted 2019-07-23 · 💻 cs.LG · cs.HC· cs.SD· eess.AS

EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings

Pith reviewed 2026-05-24 17:49 UTC · model grok-4.3

classification 💻 cs.LG cs.HCcs.SDeess.AS
keywords emotion recognitioncrossmodal embeddingsmultimodal trainingmonomodal inferencedimensional regressioncategorical classificationshared embedding space
0
0 comments X

The pith

Crossmodal emotion embeddings strengthen monomodal recognition by transferring semantic information from auxiliary modalities during training only.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EmoBed to improve emotion recognition from a single modality by drawing on knowledge from other modalities during training. It combines joint multimodal training that shares a recognition network with crossmodal training that shares an emotion embedding space. Both steps extract underlying semantic emotion information so the final model can exploit complementary cues without needing the extra modalities at inference time. Experiments on RECOLA for dimensional regression and OMG-Emotion for categorical classification show gains over monomodal baselines and results competitive with recent systems.

Core claim

EmoBed improves monomodal emotion recognition performance by training with joint multimodal learning on a shared network and crossmodal learning in a shared embedding space, allowing the system to use complementary information from auxiliary modalities without requiring their presence during inference.

What carries the argument

EmoBed framework consisting of joint multimodal training with a shared recognition network and crossmodal training with a shared emotion embedding space.

If this is right

  • Monomodal systems trained this way outperform related single-modality baselines on both regression and classification tasks.
  • The resulting monomodal performance reaches or exceeds levels reported for recent multimodal systems.
  • The framework works for both dimensional emotion regression and categorical emotion classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Emotion representations may contain a modality-independent semantic core that can be extracted through shared training.
  • The same training pattern could be tested on other paired modalities such as audio-visual or text-visual data.
  • Deployment of emotion systems could rely on cheaper single-modality hardware while still benefiting from richer training data.

Load-bearing premise

Semantic emotion information learned from multiple modalities in a shared network or embedding space transfers to improve accuracy in a monomodal network when the auxiliary modalities are absent at test time.

What would settle it

Running the EmoBed training procedure on RECOLA or OMG-Emotion and measuring no gain or a loss in monomodal regression or classification accuracy compared with standard single-modality training.

Figures

Figures reproduced from arXiv: 1907.10428 by Bj\"orn Schuller, Jing Han, Zhao Ren, Zixing Zhang.

Figure 1
Figure 1. Figure 1: The proposed crossmodal Emotion emBedding (EmoBed) framework for monomodal emotion recognition. In contrast, decision-level fusion (also known as late fu￾sion) combines the predictions, rather than the features, from the modality-specific models for a final decision by the use of certain suitable criteria [32], [34]. In addition, model-level fusion fuses the intermediate representations instead, and in thi… view at source ↗
Figure 2
Figure 2. Figure 2: Structure comparison among the proposed joint audiovisual training (e), and other related multimodal learning frameworks (i. e., early fusion [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualisation of the learnt representations of the development set of the OMG-Emotion database when using the proposed EmoBed systems or the classic monomodal systems. Red and green markers: representations from audio and video modalities; circle and cross mark￾ers: neutral and sad categories. (blue and red lines in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualisation of the learnt representations of the development set of the RECOLA database when using the proposed EmoBed sys￾tems or the classic monomodal systems. Red, green, and yellow mark￾ers: representations from audio (eGeMAPS), video (appearance), and video (geometric) modalities; circle and cross markers: high and low arousal/valence. Such an observation is even more noticeable on the OMG-Emotion d… view at source ↗
Figure 7
Figure 7. Figure 7: Impact of the joint auxiliary modality loss on the joint audio￾visual training systems (a), and impact of the crossmodal triplet loss on the crossmodal triplet training systems (b), with the OMG-Emotion database. The best performed α or β is indicated in each case. benchmark databases RECOLA and OMG-Emotion. Exper￾imental results have demonstrated that the proposed meth￾ods significantly improve the predic… view at source ↗
Figure 6
Figure 6. Figure 6: Impact of the joint auxiliary modality loss on the joint audiovisual training systems (a), and impact of the crossmodal triplet loss on the crossmodal triplet training systems (b), with the RECOLA database for valence regression. The best performed α or β is indicated in each case. To this end, proper values of the weight α and β need to be identified for the tasks at hand. We can observe from the figure t… view at source ↗
read the original abstract

Despite remarkable advances in emotion recognition, they are severely restrained from either the essentially limited property of the employed single modality, or the synchronous presence of all involved multiple modalities. Motivated by this, we propose a novel crossmodal emotion embedding framework called EmoBed, which aims to leverage the knowledge from other auxiliary modalities to improve the performance of an emotion recognition system at hand. The framework generally includes two main learning components, i. e., joint multimodal training and crossmodal training. Both of them tend to explore the underlying semantic emotion information but with a shared recognition network or with a shared emotion embedding space, respectively. In doing this, the enhanced system trained with this approach can efficiently make use of the complementary information from other modalities. Nevertheless, the presence of these auxiliary modalities is not demanded during inference. To empirically investigate the effectiveness and robustness of the proposed framework, we perform extensive experiments on the two benchmark databases RECOLA and OMG-Emotion for the tasks of dimensional emotion regression and categorical emotion classification, respectively. The obtained results show that the proposed framework significantly outperforms related baselines in monomodal inference, and are also competitive or superior to the recently reported systems, which emphasises the importance of the proposed crossmodal learning for emotion recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes EmoBed, a crossmodal emotion embedding framework with two components—joint multimodal training (sharing a recognition network) and crossmodal training (aligning to a shared emotion embedding space)—to transfer semantic emotion information from auxiliary modalities into a monomodal inference network. Experiments on RECOLA (dimensional regression) and OMG-Emotion (categorical classification) report that the resulting monomodal system outperforms standard monomodal baselines and is competitive with recent multimodal systems, while requiring only a single modality at test time.

Significance. If the reported gains hold under rigorous statistical evaluation, the work is significant for practical emotion recognition because it demonstrates a training-time mechanism to inject complementary multimodal information into single-modality models without incurring auxiliary-modality costs at inference. This addresses a common deployment constraint and could generalize to other multimodal perception tasks.

minor comments (2)
  1. Abstract, final sentence: subject-verb agreement error ('the proposed framework ... are also competitive').
  2. The abstract asserts 'significant outperformance' and 'competitive or superior' results but supplies no numerical deltas, error bars, or statistical tests; the full manuscript should make these explicit in the experimental section for reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. We are pleased that the significance of the EmoBed framework for practical single-modality emotion recognition is acknowledged.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes a training framework (joint multimodal training plus crossmodal embedding alignment) whose auxiliary modalities are used exclusively at training time; the monomodal inference network is evaluated under an explicit single-modality test constraint on RECOLA and OMG-Emotion. No equations, fitted parameters, or self-citations are shown that would reduce the reported performance gains to a definitional identity or to a renamed input. The central claim therefore rests on an externally verifiable empirical comparison rather than on any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5767 in / 1043 out tokens · 39177 ms · 2026-05-24T17:49:36.136825+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

  1. [1]

    Emotion recognition in human- computer interaction,

    R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human- computer interaction,” IEEE Signal Processing Magazine , vol. 18, no. 1, pp. 32–80, Jan. 2001

  2. [2]

    A survey of affect recognition methods: Audio, visual, and spontaneous expressions,

    Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39–58, Jan. 2009

  3. [3]

    Distributing recognition in computational paralinguistics,

    Z. Zhang, E. Coutinho, J. Deng, and B. Schuller, “Distributing recognition in computational paralinguistics,” IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 406–417, Oct. 2014

  4. [4]

    Affective computing and sentiment analysis,

    E. Cambria, “Affective computing and sentiment analysis,” IEEE Intelligent Systems, vol. 31, no. 2, pp. 102–107, Mar. 2016

  5. [5]

    Survey on speech emo- tion recognition: Features, classification schemes, and databases,

    M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emo- tion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, Mar. 2011

  6. [6]

    Automatic facial expression recogni- tion using features of salient facial patches,

    S. L. Happy and A. Routray, “Automatic facial expression recogni- tion using features of salient facial patches,” IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 1–12, Jan. 2015. 11

  7. [7]

    Sentiment analysis: From opinion mining to human-agent interaction,

    C. Clavel and Z. Callejas, “Sentiment analysis: From opinion mining to human-agent interaction,” IEEE Transactions on Affective Computing, vol. 7, no. 1, pp. 74–93, Jan. 2016

  8. [8]

    Emotions recognition using EEG signals: A survey,

    S. M. Alarcao and M. J. Fonseca, “Emotions recognition using EEG signals: A survey,” IEEE Transactions on Affective Computing , June 2018, 20 pages

  9. [9]

    Cross-corpus acoustic emotion recognition with multi-task learning: Seeking common ground while preserving differences,

    B. Zhang, E. M. Provost, and G. Essl, “Cross-corpus acoustic emotion recognition with multi-task learning: Seeking common ground while preserving differences,” IEEE Transactions on Affec- tive Computing, Mar. 2017, 14 pages

  10. [10]

    Adversarial training in affective computing and sentiment analysis: Recent advances and perspectives,

    J. Han, Z. Zhang, N. Cummins, and B. Schuller, “Adversarial training in affective computing and sentiment analysis: Recent advances and perspectives,” IEEE Computational Intelligence Maga- zine, 2018, 13 pages

  11. [11]

    Domain adversarial for acoustic emotion recognition,

    M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12, pp. 2423–2435, Dec. 2018

  12. [12]

    Continuous prediction of spontaneous affect from multiple cues and modalities in valence- arousal space,

    M. Nicolaou, H. Gunes, and M. Pantic, “Continuous prediction of spontaneous affect from multiple cues and modalities in valence- arousal space,” IEEE Transactions on Affective Computing , vol. 2, no. 2, pp. 92–105, Apr. 2011

  13. [13]

    Fusing audio, visual and textual clues for sentiment analysis from multimodal content,

    S. Poria, E. Cambria, N. Howard, G.-B. Huang, and A. Hussain, “Fusing audio, visual and textual clues for sentiment analysis from multimodal content,” Neurocomputing, vol. 174, pp. 50–59, Jan. 2016

  14. [14]

    Strength modelling for real-world automatic continuous affect recognition from audiovisual signals,

    J. Han, Z. Zhang, N. Cummins, F. Ringeval, and B. Schuller, “Strength modelling for real-world automatic continuous affect recognition from audiovisual signals,” Image and Vision Computing, vol. 65, pp. 76–86, Sep. 2017

  15. [15]

    Dynamic difficulty awareness training for continuous emotion prediction,

    Z. Zhang, J. Han, and B. Schuller, “Dynamic difficulty awareness training for continuous emotion prediction,” IEEE Transactions on Multimedia, vol. PP , Sep. 2018, 13 pages

  16. [16]

    Multimodal depression detection: Fusion analysis of paralinguistic, head pose and eye gaze behav- iors,

    S. Alghowinem, R. Goecke, M. Wagner, J. Epps, M. Hyett, G. Parker, and M. Breakspear, “Multimodal depression detection: Fusion analysis of paralinguistic, head pose and eye gaze behav- iors,” IEEE Transactions on Affective Computing , vol. 9, no. 4, pp. 478–490, Oct. 2018

  17. [17]

    Large scale online learning of image similarity through ranking,

    G. Chechik, V . Sharma, U. Shalit, and S. Bengio, “Large scale online learning of image similarity through ranking,” Journal of Machine Learning Research, vol. 11, no. Mar., pp. 1109–1135, 2010

  18. [18]

    FaceNet: A unified embedding for face recognition and clustering,

    F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. IEEE con- ference on Computer Vision and Pattern Recognition (CVPR) , Boston, MA, 2015, pp. 815–823

  19. [19]

    A multitask approach to continuous five-dimensional affect sensing in natural speech,

    F. Eyben, M. W ¨ollmer, and B. Schuller, “A multitask approach to continuous five-dimensional affect sensing in natural speech,” ACM Transactions on Interactive Intelligent Systems, vol. 2, no. 1, pp. 1–29, Mar. 2012

  20. [20]

    A multi-task learning framework for emotion recognition using 2D continuous space,

    R. Xia and Y. Liu, “A multi-task learning framework for emotion recognition using 2D continuous space,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 3–14, Jan. 2017

  21. [21]

    From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty,

    J. Han, Z. Zhang, M. Schmitt, and B. Schuller, “From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty,” in Proc. ACM International Conference on Multimedia (MM), Mountain View, CA, 2017, pp. 890–897

  22. [22]

    Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,

    J. Zhao, R. Li, S. Chen, and Q. Jin, “Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,” in Proc. 8th International Workshop on Audio/Visual Emotion Challenge (AVEC), Seoul, South Korea, 2018, pp. 65–72

  23. [23]

    Personalized multitask learning for predicting tomorrow’s mood, stress, and health,

    S. A. Taylor, N. Jaques, E. Nosakhare, A. Sano, and R. Picard, “Personalized multitask learning for predicting tomorrow’s mood, stress, and health,” IEEE Transactions on Affective Computing , Dec. 2018, 14 pages

  24. [24]

    Emotion recognition in speech with latent discriminative representations learning,

    J. Han, Z. Zhang, G. Keren, and B. Schuller, “Emotion recognition in speech with latent discriminative representations learning,” Acta Acustica united with Acustica, vol. 104, no. 5, pp. 737–740, Sep. 2018

  25. [25]

    Speech emotion recognition from variable-length inputs with triplet loss function,

    J. Huang, Y. Li, J. Tao, and Z. Lian, “Speech emotion recognition from variable-length inputs with triplet loss function,” inProc. An- nual Conference of the International Speech Communication Association (INTERSPEECH), Hyderabad, India, 2018, pp. 3673–3677

  26. [26]

    Cross-domain feature learning in multimedia,

    X. Yang, T. Zhang, and C. Xu, “Cross-domain feature learning in multimedia,” IEEE Transactions on Multimedia , vol. 17, no. 1, pp. 64–78, Jan. 2015

  27. [27]

    Learning consistent feature representation for cross-modal multimedia retrieval,

    C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan, “Learning consistent feature representation for cross-modal multimedia retrieval,”IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 370–381, Mar. 2015

  28. [28]

    SoundNet: Learning sound representations from unlabeled video,

    Y. Aytar, C. Vondrick, and A. Torralba, “SoundNet: Learning sound representations from unlabeled video,” in Proc. Advances in Neural Information Processing Systems (NIPS) , Barcelona, Spain, 2016, pp. 892–900

  29. [29]

    Adversarial cross-modal retrieval,

    B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen, “Adversarial cross-modal retrieval,” in Proc. ACM International Converence on Multimedia (MM), Mountain View, CA, 2017, pp. 154–162

  30. [30]

    Emotion recognition in speech using cross-modal transfer in the wild,

    S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emotion recognition in speech using cross-modal transfer in the wild,” in Proc. ACM International Conference on Multimedia (MM) , Seoul, Korea, 2018, pp. 292–301

  31. [31]

    Context-sensitive learning for enhanced au- diovisual emotion classification,

    A. Metallinou, M. W ¨ollmer, A. Katsamanis, F. Eyben, B. Schuller, and S. Narayanan, “Context-sensitive learning for enhanced au- diovisual emotion classification,” IEEE Transactions on Affective Computing, vol. 3, no. 2, pp. 184–198, Jan. 2012

  32. [32]

    Anal- ysis of emotion recognition using facial expressions, speech and multimodal information,

    C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Anal- ysis of emotion recognition using facial expressions, speech and multimodal information,” in Proc. International Conference on Mul- timodal Interfaces (ICMI), State College, PA, 2004, pp. 205–211

  33. [33]

    Low- level fusion of audio and video feature for multi-modal emotion recognition,

    M. Wimmer, B. Schuller, D. Arsic, B. Radig, and G. Rigoll, “Low- level fusion of audio and video feature for multi-modal emotion recognition,” in Proc. International Conference on Computer Vision Theory and Applications (VISAPP), Funchal, Portugal, 2008, pp. 145– 151

  34. [34]

    Decision level com- bination of multiple modalities for recognition and analysis of emotional expression,

    A. Metallinou, S. Lee, and S. Narayanan, “Decision level com- bination of multiple modalities for recognition and analysis of emotional expression,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Dallas, TX, 2010, pp. 2462–2465

  35. [35]

    Audio-visual affect recognition through multi- stream fused HMM for HCI,

    Z. Zeng, J. Tu, B. Pianfetti, M. Liu, T. Zhang, Z. Zhang, T. S. Huang, and S. Levinson, “Audio-visual affect recognition through multi- stream fused HMM for HCI,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , San Diego, CA, 2005, pp. 967–972

  36. [36]

    Prediction-based learning for continuous emotion recognition in speech,

    J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Prediction-based learning for continuous emotion recognition in speech,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), New Orleans, LA, 2017, pp. 5005–5009

  37. [37]

    Evaluating in- tegration strategies for visuo-haptic object recognition,

    S. Toprak, N. Navarro-Guerrero, and S. Wermter, “Evaluating in- tegration strategies for visuo-haptic object recognition,” Cognitive computation, vol. 10, no. 3, pp. 408–425, June 2018

  38. [38]

    Deep metric learning using triplet net- work,

    E. Hoffer and N. Ailon, “Deep metric learning using triplet net- work,” in Prof. International Workshop on Similarity-Based Pattern Recognition (SIMBAD), Copenhagen, Denmark, 2015, pp. 84–92

  39. [39]

    Learning deep structured semantic models for web search using click- through data,

    P . Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning deep structured semantic models for web search using click- through data,” in Proc. International Conference on Information and Knowledge Management (CIKM), San Francisco, CA, 2013, pp. 2333– 2338

  40. [40]

    Trunk-branch ensemble convolutional neural networks for video-based face recognition,

    C. Ding and D. Tao, “Trunk-branch ensemble convolutional neural networks for video-based face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 40, no. 4, pp. 1002– 1014, Apr. 2018

  41. [41]

    AVEC 2016: Depression, mood, and emotion recognition workshop and challenge,

    M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “AVEC 2016: Depression, mood, and emotion recognition workshop and challenge,” in Proc. 6th International Workshop on Audio/Visual Emo- tion Challenge (AVEC), Amsterdam, The Netherlands, 2016, pp. 3– 10

  42. [42]

    AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition,

    F. Ringeval, B. Schuller, M. Valstar, R. Cowie, H. Kaya, M. Schmitt, S. Amiriparian, N. Cummins, D. Lalanne, A. Michaud, E. Ciftc ¸i, H. G ¨ulec ¸, A. Salah, and M. Pantic, “AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition,” in Proc. 8th International Workshop on Audio/Visual Emotion Challenge (AVEC), Seoul, South K...

  43. [43]

    Introduc- ing the RECOLA multimodal corpus of remote collaborative and affective interactions,

    F. Ringeval, A. Sonderegger, J. S. Sauer, and D. Lalanne, “Introduc- ing the RECOLA multimodal corpus of remote collaborative and affective interactions,” in Proc. 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) , Shanghai, China, 2013, pp. 1–8

  44. [44]

    AV+EC 2015: The first affect recognition challenge bridging across audio, video, and physio- logical data,

    F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic, “AV+EC 2015: The first affect recognition challenge bridging across audio, video, and physio- logical data,” in Proc. 5th International Workshop on Audio/Visual Emotion Challenge (AVEC), Brisbane, Australia, 2015, pp. 3–8. 12

  45. [45]

    The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing,

    F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Devillers, J. Epps, P . Laukka, S. Narayanan, and K. Truong, “The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, Apr. 2016

  46. [46]

    openSMILE – The Munich versatile and fast open-source audio feature extractor,

    F. Eyben, M. W ¨ollmer, and B. Schuller, “openSMILE – The Munich versatile and fast open-source audio feature extractor,” in Proc. ACM International Conference on Multimedia (MM) , Florence, Italy, 2010, pp. 1459–1462

  47. [47]

    The OMG-emotion behavior dataset,

    P . Barros, N. Churamani, E. Lakomkin, H. Siqueira, A. Sutherland, and S. Wermter, “The OMG-emotion behavior dataset,” inProc. In- ternational Joint Conference on Neural Networks (IJCNN) , Rio, Brazil, 2018, pp. 1408–1412

  48. [48]

    Joint face detection and alignment using multitask cascaded convolutional networks,

    K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters , vol. 23, no. 10, pp. 1499–1503, Oct. 2016

  49. [49]

    Deep face recogni- tion,

    O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recogni- tion,” in Proc. British Machine Vision Conference, Swansea, UK, 2015, pp. 1–12

  50. [50]

    On the properties of neural machine translation: Encoder-decoder approaches,

    K. Cho, B. Van Merri ¨enboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” in Proc. Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST), Doha, Qatar, 2014, pp. 103–111

  51. [51]

    An empirical explo- ration of recurrent network architectures,

    R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical explo- ration of recurrent network architectures,” in Proc. International Conference on Machine Learning (ICML) , Lille, France, 2015, pp. 2342–2350

  52. [52]

    Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,

    S. Mariooryad and C. Busso, “Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,” IEEE Transactions on Affective Computing , vol. 6, no. 2, pp. 97–108, Apr. 2015

  53. [53]

    Cohen, P

    J. Cohen, P . Cohen, S. G. West, and L. S. Aiken, Applied multiple regression/correlation analysis for the behavioral sciences . Abingdon, UK: Routledge, 2013

  54. [54]

    End-to-end multimodal emotion recognition using deep neural networks,

    P . Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Pro- cessing, Special Issue on End-to-End Speech and Language Processing , vol. 11, no. 8, pp. 1301–1309, Dec. 2017

  55. [55]

    Reconstruction- error-based learning for continuous emotion recognition in speech,

    J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Reconstruction- error-based learning for continuous emotion recognition in speech,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , New Orleans, LA, 2017, pp. 2367– 2371

  56. [56]

    Curriculum learning for speech emotion recognition from crowdsourced labels,

    R. Lotfian and C. Busso, “Curriculum learning for speech emotion recognition from crowdsourced labels,” arXiv preprint arXiv:1805.10339, May 2018. Jing Han (S’16) received her bachelor de- gree (2011) in electronic and information en- gineering from Harbin Engineering University (HEU), China, and her master degree (2014) from Nanyang Technological Universi...