EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings
Pith reviewed 2026-05-24 17:49 UTC · model grok-4.3
The pith
Crossmodal emotion embeddings strengthen monomodal recognition by transferring semantic information from auxiliary modalities during training only.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EmoBed improves monomodal emotion recognition performance by training with joint multimodal learning on a shared network and crossmodal learning in a shared embedding space, allowing the system to use complementary information from auxiliary modalities without requiring their presence during inference.
What carries the argument
EmoBed framework consisting of joint multimodal training with a shared recognition network and crossmodal training with a shared emotion embedding space.
If this is right
- Monomodal systems trained this way outperform related single-modality baselines on both regression and classification tasks.
- The resulting monomodal performance reaches or exceeds levels reported for recent multimodal systems.
- The framework works for both dimensional emotion regression and categorical emotion classification.
Where Pith is reading between the lines
- Emotion representations may contain a modality-independent semantic core that can be extracted through shared training.
- The same training pattern could be tested on other paired modalities such as audio-visual or text-visual data.
- Deployment of emotion systems could rely on cheaper single-modality hardware while still benefiting from richer training data.
Load-bearing premise
Semantic emotion information learned from multiple modalities in a shared network or embedding space transfers to improve accuracy in a monomodal network when the auxiliary modalities are absent at test time.
What would settle it
Running the EmoBed training procedure on RECOLA or OMG-Emotion and measuring no gain or a loss in monomodal regression or classification accuracy compared with standard single-modality training.
Figures
read the original abstract
Despite remarkable advances in emotion recognition, they are severely restrained from either the essentially limited property of the employed single modality, or the synchronous presence of all involved multiple modalities. Motivated by this, we propose a novel crossmodal emotion embedding framework called EmoBed, which aims to leverage the knowledge from other auxiliary modalities to improve the performance of an emotion recognition system at hand. The framework generally includes two main learning components, i. e., joint multimodal training and crossmodal training. Both of them tend to explore the underlying semantic emotion information but with a shared recognition network or with a shared emotion embedding space, respectively. In doing this, the enhanced system trained with this approach can efficiently make use of the complementary information from other modalities. Nevertheless, the presence of these auxiliary modalities is not demanded during inference. To empirically investigate the effectiveness and robustness of the proposed framework, we perform extensive experiments on the two benchmark databases RECOLA and OMG-Emotion for the tasks of dimensional emotion regression and categorical emotion classification, respectively. The obtained results show that the proposed framework significantly outperforms related baselines in monomodal inference, and are also competitive or superior to the recently reported systems, which emphasises the importance of the proposed crossmodal learning for emotion recognition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EmoBed, a crossmodal emotion embedding framework with two components—joint multimodal training (sharing a recognition network) and crossmodal training (aligning to a shared emotion embedding space)—to transfer semantic emotion information from auxiliary modalities into a monomodal inference network. Experiments on RECOLA (dimensional regression) and OMG-Emotion (categorical classification) report that the resulting monomodal system outperforms standard monomodal baselines and is competitive with recent multimodal systems, while requiring only a single modality at test time.
Significance. If the reported gains hold under rigorous statistical evaluation, the work is significant for practical emotion recognition because it demonstrates a training-time mechanism to inject complementary multimodal information into single-modality models without incurring auxiliary-modality costs at inference. This addresses a common deployment constraint and could generalize to other multimodal perception tasks.
minor comments (2)
- Abstract, final sentence: subject-verb agreement error ('the proposed framework ... are also competitive').
- The abstract asserts 'significant outperformance' and 'competitive or superior' results but supplies no numerical deltas, error bars, or statistical tests; the full manuscript should make these explicit in the experimental section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. We are pleased that the significance of the EmoBed framework for practical single-modality emotion recognition is acknowledged.
Circularity Check
No significant circularity identified
full rationale
The paper describes a training framework (joint multimodal training plus crossmodal embedding alignment) whose auxiliary modalities are used exclusively at training time; the monomodal inference network is evaluated under an explicit single-modality test constraint on RECOLA and OMG-Emotion. No equations, fitted parameters, or self-citations are shown that would reduce the reported performance gains to a definitional identity or to a renamed input. The central claim therefore rests on an externally verifiable empirical comparison rather than on any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Emotion recognition in human- computer interaction,
R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human- computer interaction,” IEEE Signal Processing Magazine , vol. 18, no. 1, pp. 32–80, Jan. 2001
work page 2001
-
[2]
A survey of affect recognition methods: Audio, visual, and spontaneous expressions,
Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39–58, Jan. 2009
work page 2009
-
[3]
Distributing recognition in computational paralinguistics,
Z. Zhang, E. Coutinho, J. Deng, and B. Schuller, “Distributing recognition in computational paralinguistics,” IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 406–417, Oct. 2014
work page 2014
-
[4]
Affective computing and sentiment analysis,
E. Cambria, “Affective computing and sentiment analysis,” IEEE Intelligent Systems, vol. 31, no. 2, pp. 102–107, Mar. 2016
work page 2016
-
[5]
Survey on speech emo- tion recognition: Features, classification schemes, and databases,
M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emo- tion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, Mar. 2011
work page 2011
-
[6]
Automatic facial expression recogni- tion using features of salient facial patches,
S. L. Happy and A. Routray, “Automatic facial expression recogni- tion using features of salient facial patches,” IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 1–12, Jan. 2015. 11
work page 2015
-
[7]
Sentiment analysis: From opinion mining to human-agent interaction,
C. Clavel and Z. Callejas, “Sentiment analysis: From opinion mining to human-agent interaction,” IEEE Transactions on Affective Computing, vol. 7, no. 1, pp. 74–93, Jan. 2016
work page 2016
-
[8]
Emotions recognition using EEG signals: A survey,
S. M. Alarcao and M. J. Fonseca, “Emotions recognition using EEG signals: A survey,” IEEE Transactions on Affective Computing , June 2018, 20 pages
work page 2018
-
[9]
B. Zhang, E. M. Provost, and G. Essl, “Cross-corpus acoustic emotion recognition with multi-task learning: Seeking common ground while preserving differences,” IEEE Transactions on Affec- tive Computing, Mar. 2017, 14 pages
work page 2017
-
[10]
J. Han, Z. Zhang, N. Cummins, and B. Schuller, “Adversarial training in affective computing and sentiment analysis: Recent advances and perspectives,” IEEE Computational Intelligence Maga- zine, 2018, 13 pages
work page 2018
-
[11]
Domain adversarial for acoustic emotion recognition,
M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12, pp. 2423–2435, Dec. 2018
work page 2018
-
[12]
M. Nicolaou, H. Gunes, and M. Pantic, “Continuous prediction of spontaneous affect from multiple cues and modalities in valence- arousal space,” IEEE Transactions on Affective Computing , vol. 2, no. 2, pp. 92–105, Apr. 2011
work page 2011
-
[13]
Fusing audio, visual and textual clues for sentiment analysis from multimodal content,
S. Poria, E. Cambria, N. Howard, G.-B. Huang, and A. Hussain, “Fusing audio, visual and textual clues for sentiment analysis from multimodal content,” Neurocomputing, vol. 174, pp. 50–59, Jan. 2016
work page 2016
-
[14]
Strength modelling for real-world automatic continuous affect recognition from audiovisual signals,
J. Han, Z. Zhang, N. Cummins, F. Ringeval, and B. Schuller, “Strength modelling for real-world automatic continuous affect recognition from audiovisual signals,” Image and Vision Computing, vol. 65, pp. 76–86, Sep. 2017
work page 2017
-
[15]
Dynamic difficulty awareness training for continuous emotion prediction,
Z. Zhang, J. Han, and B. Schuller, “Dynamic difficulty awareness training for continuous emotion prediction,” IEEE Transactions on Multimedia, vol. PP , Sep. 2018, 13 pages
work page 2018
-
[16]
S. Alghowinem, R. Goecke, M. Wagner, J. Epps, M. Hyett, G. Parker, and M. Breakspear, “Multimodal depression detection: Fusion analysis of paralinguistic, head pose and eye gaze behav- iors,” IEEE Transactions on Affective Computing , vol. 9, no. 4, pp. 478–490, Oct. 2018
work page 2018
-
[17]
Large scale online learning of image similarity through ranking,
G. Chechik, V . Sharma, U. Shalit, and S. Bengio, “Large scale online learning of image similarity through ranking,” Journal of Machine Learning Research, vol. 11, no. Mar., pp. 1109–1135, 2010
work page 2010
-
[18]
FaceNet: A unified embedding for face recognition and clustering,
F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. IEEE con- ference on Computer Vision and Pattern Recognition (CVPR) , Boston, MA, 2015, pp. 815–823
work page 2015
-
[19]
A multitask approach to continuous five-dimensional affect sensing in natural speech,
F. Eyben, M. W ¨ollmer, and B. Schuller, “A multitask approach to continuous five-dimensional affect sensing in natural speech,” ACM Transactions on Interactive Intelligent Systems, vol. 2, no. 1, pp. 1–29, Mar. 2012
work page 2012
-
[20]
A multi-task learning framework for emotion recognition using 2D continuous space,
R. Xia and Y. Liu, “A multi-task learning framework for emotion recognition using 2D continuous space,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 3–14, Jan. 2017
work page 2017
-
[21]
J. Han, Z. Zhang, M. Schmitt, and B. Schuller, “From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty,” in Proc. ACM International Conference on Multimedia (MM), Mountain View, CA, 2017, pp. 890–897
work page 2017
-
[22]
Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,
J. Zhao, R. Li, S. Chen, and Q. Jin, “Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,” in Proc. 8th International Workshop on Audio/Visual Emotion Challenge (AVEC), Seoul, South Korea, 2018, pp. 65–72
work page 2018
-
[23]
Personalized multitask learning for predicting tomorrow’s mood, stress, and health,
S. A. Taylor, N. Jaques, E. Nosakhare, A. Sano, and R. Picard, “Personalized multitask learning for predicting tomorrow’s mood, stress, and health,” IEEE Transactions on Affective Computing , Dec. 2018, 14 pages
work page 2018
-
[24]
Emotion recognition in speech with latent discriminative representations learning,
J. Han, Z. Zhang, G. Keren, and B. Schuller, “Emotion recognition in speech with latent discriminative representations learning,” Acta Acustica united with Acustica, vol. 104, no. 5, pp. 737–740, Sep. 2018
work page 2018
-
[25]
Speech emotion recognition from variable-length inputs with triplet loss function,
J. Huang, Y. Li, J. Tao, and Z. Lian, “Speech emotion recognition from variable-length inputs with triplet loss function,” inProc. An- nual Conference of the International Speech Communication Association (INTERSPEECH), Hyderabad, India, 2018, pp. 3673–3677
work page 2018
-
[26]
Cross-domain feature learning in multimedia,
X. Yang, T. Zhang, and C. Xu, “Cross-domain feature learning in multimedia,” IEEE Transactions on Multimedia , vol. 17, no. 1, pp. 64–78, Jan. 2015
work page 2015
-
[27]
Learning consistent feature representation for cross-modal multimedia retrieval,
C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan, “Learning consistent feature representation for cross-modal multimedia retrieval,”IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 370–381, Mar. 2015
work page 2015
-
[28]
SoundNet: Learning sound representations from unlabeled video,
Y. Aytar, C. Vondrick, and A. Torralba, “SoundNet: Learning sound representations from unlabeled video,” in Proc. Advances in Neural Information Processing Systems (NIPS) , Barcelona, Spain, 2016, pp. 892–900
work page 2016
-
[29]
Adversarial cross-modal retrieval,
B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen, “Adversarial cross-modal retrieval,” in Proc. ACM International Converence on Multimedia (MM), Mountain View, CA, 2017, pp. 154–162
work page 2017
-
[30]
Emotion recognition in speech using cross-modal transfer in the wild,
S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emotion recognition in speech using cross-modal transfer in the wild,” in Proc. ACM International Conference on Multimedia (MM) , Seoul, Korea, 2018, pp. 292–301
work page 2018
-
[31]
Context-sensitive learning for enhanced au- diovisual emotion classification,
A. Metallinou, M. W ¨ollmer, A. Katsamanis, F. Eyben, B. Schuller, and S. Narayanan, “Context-sensitive learning for enhanced au- diovisual emotion classification,” IEEE Transactions on Affective Computing, vol. 3, no. 2, pp. 184–198, Jan. 2012
work page 2012
-
[32]
Anal- ysis of emotion recognition using facial expressions, speech and multimodal information,
C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Anal- ysis of emotion recognition using facial expressions, speech and multimodal information,” in Proc. International Conference on Mul- timodal Interfaces (ICMI), State College, PA, 2004, pp. 205–211
work page 2004
-
[33]
Low- level fusion of audio and video feature for multi-modal emotion recognition,
M. Wimmer, B. Schuller, D. Arsic, B. Radig, and G. Rigoll, “Low- level fusion of audio and video feature for multi-modal emotion recognition,” in Proc. International Conference on Computer Vision Theory and Applications (VISAPP), Funchal, Portugal, 2008, pp. 145– 151
work page 2008
-
[34]
A. Metallinou, S. Lee, and S. Narayanan, “Decision level com- bination of multiple modalities for recognition and analysis of emotional expression,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Dallas, TX, 2010, pp. 2462–2465
work page 2010
-
[35]
Audio-visual affect recognition through multi- stream fused HMM for HCI,
Z. Zeng, J. Tu, B. Pianfetti, M. Liu, T. Zhang, Z. Zhang, T. S. Huang, and S. Levinson, “Audio-visual affect recognition through multi- stream fused HMM for HCI,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , San Diego, CA, 2005, pp. 967–972
work page 2005
-
[36]
Prediction-based learning for continuous emotion recognition in speech,
J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Prediction-based learning for continuous emotion recognition in speech,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), New Orleans, LA, 2017, pp. 5005–5009
work page 2017
-
[37]
Evaluating in- tegration strategies for visuo-haptic object recognition,
S. Toprak, N. Navarro-Guerrero, and S. Wermter, “Evaluating in- tegration strategies for visuo-haptic object recognition,” Cognitive computation, vol. 10, no. 3, pp. 408–425, June 2018
work page 2018
-
[38]
Deep metric learning using triplet net- work,
E. Hoffer and N. Ailon, “Deep metric learning using triplet net- work,” in Prof. International Workshop on Similarity-Based Pattern Recognition (SIMBAD), Copenhagen, Denmark, 2015, pp. 84–92
work page 2015
-
[39]
Learning deep structured semantic models for web search using click- through data,
P . Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning deep structured semantic models for web search using click- through data,” in Proc. International Conference on Information and Knowledge Management (CIKM), San Francisco, CA, 2013, pp. 2333– 2338
work page 2013
-
[40]
Trunk-branch ensemble convolutional neural networks for video-based face recognition,
C. Ding and D. Tao, “Trunk-branch ensemble convolutional neural networks for video-based face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 40, no. 4, pp. 1002– 1014, Apr. 2018
work page 2018
-
[41]
AVEC 2016: Depression, mood, and emotion recognition workshop and challenge,
M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “AVEC 2016: Depression, mood, and emotion recognition workshop and challenge,” in Proc. 6th International Workshop on Audio/Visual Emo- tion Challenge (AVEC), Amsterdam, The Netherlands, 2016, pp. 3– 10
work page 2016
-
[42]
AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition,
F. Ringeval, B. Schuller, M. Valstar, R. Cowie, H. Kaya, M. Schmitt, S. Amiriparian, N. Cummins, D. Lalanne, A. Michaud, E. Ciftc ¸i, H. G ¨ulec ¸, A. Salah, and M. Pantic, “AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition,” in Proc. 8th International Workshop on Audio/Visual Emotion Challenge (AVEC), Seoul, South K...
work page 2018
-
[43]
Introduc- ing the RECOLA multimodal corpus of remote collaborative and affective interactions,
F. Ringeval, A. Sonderegger, J. S. Sauer, and D. Lalanne, “Introduc- ing the RECOLA multimodal corpus of remote collaborative and affective interactions,” in Proc. 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) , Shanghai, China, 2013, pp. 1–8
work page 2013
-
[44]
F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic, “AV+EC 2015: The first affect recognition challenge bridging across audio, video, and physio- logical data,” in Proc. 5th International Workshop on Audio/Visual Emotion Challenge (AVEC), Brisbane, Australia, 2015, pp. 3–8. 12
work page 2015
-
[45]
The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing,
F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Devillers, J. Epps, P . Laukka, S. Narayanan, and K. Truong, “The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, Apr. 2016
work page 2016
-
[46]
openSMILE – The Munich versatile and fast open-source audio feature extractor,
F. Eyben, M. W ¨ollmer, and B. Schuller, “openSMILE – The Munich versatile and fast open-source audio feature extractor,” in Proc. ACM International Conference on Multimedia (MM) , Florence, Italy, 2010, pp. 1459–1462
work page 2010
-
[47]
The OMG-emotion behavior dataset,
P . Barros, N. Churamani, E. Lakomkin, H. Siqueira, A. Sutherland, and S. Wermter, “The OMG-emotion behavior dataset,” inProc. In- ternational Joint Conference on Neural Networks (IJCNN) , Rio, Brazil, 2018, pp. 1408–1412
work page 2018
-
[48]
Joint face detection and alignment using multitask cascaded convolutional networks,
K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters , vol. 23, no. 10, pp. 1499–1503, Oct. 2016
work page 2016
-
[49]
O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recogni- tion,” in Proc. British Machine Vision Conference, Swansea, UK, 2015, pp. 1–12
work page 2015
-
[50]
On the properties of neural machine translation: Encoder-decoder approaches,
K. Cho, B. Van Merri ¨enboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” in Proc. Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST), Doha, Qatar, 2014, pp. 103–111
work page 2014
-
[51]
An empirical explo- ration of recurrent network architectures,
R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical explo- ration of recurrent network architectures,” in Proc. International Conference on Machine Learning (ICML) , Lille, France, 2015, pp. 2342–2350
work page 2015
-
[52]
Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,
S. Mariooryad and C. Busso, “Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,” IEEE Transactions on Affective Computing , vol. 6, no. 2, pp. 97–108, Apr. 2015
work page 2015
- [53]
-
[54]
End-to-end multimodal emotion recognition using deep neural networks,
P . Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Pro- cessing, Special Issue on End-to-End Speech and Language Processing , vol. 11, no. 8, pp. 1301–1309, Dec. 2017
work page 2017
-
[55]
Reconstruction- error-based learning for continuous emotion recognition in speech,
J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Reconstruction- error-based learning for continuous emotion recognition in speech,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , New Orleans, LA, 2017, pp. 2367– 2371
work page 2017
-
[56]
Curriculum learning for speech emotion recognition from crowdsourced labels,
R. Lotfian and C. Busso, “Curriculum learning for speech emotion recognition from crowdsourced labels,” arXiv preprint arXiv:1805.10339, May 2018. Jing Han (S’16) received her bachelor de- gree (2011) in electronic and information en- gineering from Harbin Engineering University (HEU), China, and her master degree (2014) from Nanyang Technological Universi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.