EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings

Bj\"orn Schuller; Jing Han; Zhao Ren; Zixing Zhang

arxiv: 1907.10428 · v1 · pith:7OYYHJVPnew · submitted 2019-07-23 · 💻 cs.LG · cs.HC· cs.SD· eess.AS

EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings

Jing Han , Zixing Zhang , Zhao Ren , Bj\"orn Schuller This is my paper

Pith reviewed 2026-05-24 17:49 UTC · model grok-4.3

classification 💻 cs.LG cs.HCcs.SDeess.AS

keywords emotion recognitioncrossmodal embeddingsmultimodal trainingmonomodal inferencedimensional regressioncategorical classificationshared embedding space

0 comments

The pith

Crossmodal emotion embeddings strengthen monomodal recognition by transferring semantic information from auxiliary modalities during training only.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EmoBed to improve emotion recognition from a single modality by drawing on knowledge from other modalities during training. It combines joint multimodal training that shares a recognition network with crossmodal training that shares an emotion embedding space. Both steps extract underlying semantic emotion information so the final model can exploit complementary cues without needing the extra modalities at inference time. Experiments on RECOLA for dimensional regression and OMG-Emotion for categorical classification show gains over monomodal baselines and results competitive with recent systems.

Core claim

EmoBed improves monomodal emotion recognition performance by training with joint multimodal learning on a shared network and crossmodal learning in a shared embedding space, allowing the system to use complementary information from auxiliary modalities without requiring their presence during inference.

What carries the argument

EmoBed framework consisting of joint multimodal training with a shared recognition network and crossmodal training with a shared emotion embedding space.

If this is right

Monomodal systems trained this way outperform related single-modality baselines on both regression and classification tasks.
The resulting monomodal performance reaches or exceeds levels reported for recent multimodal systems.
The framework works for both dimensional emotion regression and categorical emotion classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Emotion representations may contain a modality-independent semantic core that can be extracted through shared training.
The same training pattern could be tested on other paired modalities such as audio-visual or text-visual data.
Deployment of emotion systems could rely on cheaper single-modality hardware while still benefiting from richer training data.

Load-bearing premise

Semantic emotion information learned from multiple modalities in a shared network or embedding space transfers to improve accuracy in a monomodal network when the auxiliary modalities are absent at test time.

What would settle it

Running the EmoBed training procedure on RECOLA or OMG-Emotion and measuring no gain or a loss in monomodal regression or classification accuracy compared with standard single-modality training.

Figures

Figures reproduced from arXiv: 1907.10428 by Bj\"orn Schuller, Jing Han, Zhao Ren, Zixing Zhang.

**Figure 1.** Figure 1: The proposed crossmodal Emotion emBedding (EmoBed) framework for monomodal emotion recognition. In contrast, decision-level fusion (also known as late fusion) combines the predictions, rather than the features, from the modality-specific models for a final decision by the use of certain suitable criteria [32], [34]. In addition, model-level fusion fuses the intermediate representations instead, and in thi… view at source ↗

**Figure 2.** Figure 2: Structure comparison among the proposed joint audiovisual training (e), and other related multimodal learning frameworks (i. e., early fusion [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Visualisation of the learnt representations of the development set of the OMG-Emotion database when using the proposed EmoBed systems or the classic monomodal systems. Red and green markers: representations from audio and video modalities; circle and cross markers: neutral and sad categories. (blue and red lines in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 3.** Figure 3: Visualisation of the learnt representations of the development set of the RECOLA database when using the proposed EmoBed systems or the classic monomodal systems. Red, green, and yellow markers: representations from audio (eGeMAPS), video (appearance), and video (geometric) modalities; circle and cross markers: high and low arousal/valence. Such an observation is even more noticeable on the OMG-Emotion d… view at source ↗

**Figure 7.** Figure 7: Impact of the joint auxiliary modality loss on the joint audiovisual training systems (a), and impact of the crossmodal triplet loss on the crossmodal triplet training systems (b), with the OMG-Emotion database. The best performed α or β is indicated in each case. benchmark databases RECOLA and OMG-Emotion. Experimental results have demonstrated that the proposed methods significantly improve the predic… view at source ↗

**Figure 6.** Figure 6: Impact of the joint auxiliary modality loss on the joint audiovisual training systems (a), and impact of the crossmodal triplet loss on the crossmodal triplet training systems (b), with the RECOLA database for valence regression. The best performed α or β is indicated in each case. To this end, proper values of the weight α and β need to be identified for the tasks at hand. We can observe from the figure t… view at source ↗

read the original abstract

Despite remarkable advances in emotion recognition, they are severely restrained from either the essentially limited property of the employed single modality, or the synchronous presence of all involved multiple modalities. Motivated by this, we propose a novel crossmodal emotion embedding framework called EmoBed, which aims to leverage the knowledge from other auxiliary modalities to improve the performance of an emotion recognition system at hand. The framework generally includes two main learning components, i. e., joint multimodal training and crossmodal training. Both of them tend to explore the underlying semantic emotion information but with a shared recognition network or with a shared emotion embedding space, respectively. In doing this, the enhanced system trained with this approach can efficiently make use of the complementary information from other modalities. Nevertheless, the presence of these auxiliary modalities is not demanded during inference. To empirically investigate the effectiveness and robustness of the proposed framework, we perform extensive experiments on the two benchmark databases RECOLA and OMG-Emotion for the tasks of dimensional emotion regression and categorical emotion classification, respectively. The obtained results show that the proposed framework significantly outperforms related baselines in monomodal inference, and are also competitive or superior to the recently reported systems, which emphasises the importance of the proposed crossmodal learning for emotion recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EmoBed shows a workable way to lift monomodal emotion recognition by training with auxiliary modalities that drop out at test time, and the reported gains on RECOLA and OMG-Emotion look consistent enough to review.

read the letter

The key point is that this paper gives a concrete way to improve monomodal emotion recognition by using crossmodal information only during training. The EmoBed approach combines joint training across modalities with alignment in a shared embedding space, then runs inference on one modality alone. Experiments on RECOLA for dimensional regression and OMG-Emotion for categorical classification show it beats related baselines and holds up against recent reported systems. What the paper does well is keep the inference constraint strict—no auxiliary modalities at test time—and still report gains. That matches real-world needs where you might collect multimodal data for training but deploy on single sensors. The two tasks give a broader view than just one. The soft spots are in the evaluation details. Without seeing error bars or statistical tests in the abstract, it's tough to gauge if the outperformance is robust. The full text likely includes them, but if the gains are small or vary by run, that would weaken the case. The broader crossmodal idea is not brand new, so the contribution sits in the specific application and combination. Readers working on practical emotion recognition systems in affective computing would get the most from this. It offers a training recipe they can try without changing their inference setup. Someone outside that area might not need it. The work shows clear thinking on the transfer problem and engages with the literature through the baselines. It deserves a serious referee because the central mechanism is consistent and the results are on public benchmarks. I recommend sending it to peer review. Ask for more on component contributions and any variance in the numbers.

Referee Report

0 major / 2 minor

Summary. The paper proposes EmoBed, a crossmodal emotion embedding framework with two components—joint multimodal training (sharing a recognition network) and crossmodal training (aligning to a shared emotion embedding space)—to transfer semantic emotion information from auxiliary modalities into a monomodal inference network. Experiments on RECOLA (dimensional regression) and OMG-Emotion (categorical classification) report that the resulting monomodal system outperforms standard monomodal baselines and is competitive with recent multimodal systems, while requiring only a single modality at test time.

Significance. If the reported gains hold under rigorous statistical evaluation, the work is significant for practical emotion recognition because it demonstrates a training-time mechanism to inject complementary multimodal information into single-modality models without incurring auxiliary-modality costs at inference. This addresses a common deployment constraint and could generalize to other multimodal perception tasks.

minor comments (2)

Abstract, final sentence: subject-verb agreement error ('the proposed framework ... are also competitive').
The abstract asserts 'significant outperformance' and 'competitive or superior' results but supplies no numerical deltas, error bars, or statistical tests; the full manuscript should make these explicit in the experimental section for reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. We are pleased that the significance of the EmoBed framework for practical single-modality emotion recognition is acknowledged.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes a training framework (joint multimodal training plus crossmodal embedding alignment) whose auxiliary modalities are used exclusively at training time; the monomodal inference network is evaluated under an explicit single-modality test constraint on RECOLA and OMG-Emotion. No equations, fitted parameters, or self-citations are shown that would reduce the reported performance gains to a definitional identity or to a renamed input. The central claim therefore rests on an externally verifiable empirical comparison rather than on any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5767 in / 1043 out tokens · 39177 ms · 2026-05-24T17:49:36.136825+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

[1]

Emotion recognition in human- computer interaction,

R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human- computer interaction,” IEEE Signal Processing Magazine , vol. 18, no. 1, pp. 32–80, Jan. 2001

work page 2001
[2]

A survey of affect recognition methods: Audio, visual, and spontaneous expressions,

Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39–58, Jan. 2009

work page 2009
[3]

Distributing recognition in computational paralinguistics,

Z. Zhang, E. Coutinho, J. Deng, and B. Schuller, “Distributing recognition in computational paralinguistics,” IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 406–417, Oct. 2014

work page 2014
[4]

Affective computing and sentiment analysis,

E. Cambria, “Affective computing and sentiment analysis,” IEEE Intelligent Systems, vol. 31, no. 2, pp. 102–107, Mar. 2016

work page 2016
[5]

Survey on speech emo- tion recognition: Features, classiﬁcation schemes, and databases,

M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emo- tion recognition: Features, classiﬁcation schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, Mar. 2011

work page 2011
[6]

Automatic facial expression recogni- tion using features of salient facial patches,

S. L. Happy and A. Routray, “Automatic facial expression recogni- tion using features of salient facial patches,” IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 1–12, Jan. 2015. 11

work page 2015
[7]

Sentiment analysis: From opinion mining to human-agent interaction,

C. Clavel and Z. Callejas, “Sentiment analysis: From opinion mining to human-agent interaction,” IEEE Transactions on Affective Computing, vol. 7, no. 1, pp. 74–93, Jan. 2016

work page 2016
[8]

Emotions recognition using EEG signals: A survey,

S. M. Alarcao and M. J. Fonseca, “Emotions recognition using EEG signals: A survey,” IEEE Transactions on Affective Computing , June 2018, 20 pages

work page 2018
[9]

Cross-corpus acoustic emotion recognition with multi-task learning: Seeking common ground while preserving differences,

B. Zhang, E. M. Provost, and G. Essl, “Cross-corpus acoustic emotion recognition with multi-task learning: Seeking common ground while preserving differences,” IEEE Transactions on Affec- tive Computing, Mar. 2017, 14 pages

work page 2017
[10]

Adversarial training in affective computing and sentiment analysis: Recent advances and perspectives,

J. Han, Z. Zhang, N. Cummins, and B. Schuller, “Adversarial training in affective computing and sentiment analysis: Recent advances and perspectives,” IEEE Computational Intelligence Maga- zine, 2018, 13 pages

work page 2018
[11]

Domain adversarial for acoustic emotion recognition,

M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12, pp. 2423–2435, Dec. 2018

work page 2018
[12]

Continuous prediction of spontaneous affect from multiple cues and modalities in valence- arousal space,

M. Nicolaou, H. Gunes, and M. Pantic, “Continuous prediction of spontaneous affect from multiple cues and modalities in valence- arousal space,” IEEE Transactions on Affective Computing , vol. 2, no. 2, pp. 92–105, Apr. 2011

work page 2011
[13]

Fusing audio, visual and textual clues for sentiment analysis from multimodal content,

S. Poria, E. Cambria, N. Howard, G.-B. Huang, and A. Hussain, “Fusing audio, visual and textual clues for sentiment analysis from multimodal content,” Neurocomputing, vol. 174, pp. 50–59, Jan. 2016

work page 2016
[14]

Strength modelling for real-world automatic continuous affect recognition from audiovisual signals,

J. Han, Z. Zhang, N. Cummins, F. Ringeval, and B. Schuller, “Strength modelling for real-world automatic continuous affect recognition from audiovisual signals,” Image and Vision Computing, vol. 65, pp. 76–86, Sep. 2017

work page 2017
[15]

Dynamic difﬁculty awareness training for continuous emotion prediction,

Z. Zhang, J. Han, and B. Schuller, “Dynamic difﬁculty awareness training for continuous emotion prediction,” IEEE Transactions on Multimedia, vol. PP , Sep. 2018, 13 pages

work page 2018
[16]

Multimodal depression detection: Fusion analysis of paralinguistic, head pose and eye gaze behav- iors,

S. Alghowinem, R. Goecke, M. Wagner, J. Epps, M. Hyett, G. Parker, and M. Breakspear, “Multimodal depression detection: Fusion analysis of paralinguistic, head pose and eye gaze behav- iors,” IEEE Transactions on Affective Computing , vol. 9, no. 4, pp. 478–490, Oct. 2018

work page 2018
[17]

Large scale online learning of image similarity through ranking,

G. Chechik, V . Sharma, U. Shalit, and S. Bengio, “Large scale online learning of image similarity through ranking,” Journal of Machine Learning Research, vol. 11, no. Mar., pp. 1109–1135, 2010

work page 2010
[18]

FaceNet: A uniﬁed embedding for face recognition and clustering,

F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A uniﬁed embedding for face recognition and clustering,” in Proc. IEEE con- ference on Computer Vision and Pattern Recognition (CVPR) , Boston, MA, 2015, pp. 815–823

work page 2015
[19]

A multitask approach to continuous ﬁve-dimensional affect sensing in natural speech,

F. Eyben, M. W ¨ollmer, and B. Schuller, “A multitask approach to continuous ﬁve-dimensional affect sensing in natural speech,” ACM Transactions on Interactive Intelligent Systems, vol. 2, no. 1, pp. 1–29, Mar. 2012

work page 2012
[20]

A multi-task learning framework for emotion recognition using 2D continuous space,

R. Xia and Y. Liu, “A multi-task learning framework for emotion recognition using 2D continuous space,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 3–14, Jan. 2017

work page 2017
[21]

From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty,

J. Han, Z. Zhang, M. Schmitt, and B. Schuller, “From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty,” in Proc. ACM International Conference on Multimedia (MM), Mountain View, CA, 2017, pp. 890–897

work page 2017
[22]

Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,

J. Zhao, R. Li, S. Chen, and Q. Jin, “Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,” in Proc. 8th International Workshop on Audio/Visual Emotion Challenge (AVEC), Seoul, South Korea, 2018, pp. 65–72

work page 2018
[23]

Personalized multitask learning for predicting tomorrow’s mood, stress, and health,

S. A. Taylor, N. Jaques, E. Nosakhare, A. Sano, and R. Picard, “Personalized multitask learning for predicting tomorrow’s mood, stress, and health,” IEEE Transactions on Affective Computing , Dec. 2018, 14 pages

work page 2018
[24]

Emotion recognition in speech with latent discriminative representations learning,

J. Han, Z. Zhang, G. Keren, and B. Schuller, “Emotion recognition in speech with latent discriminative representations learning,” Acta Acustica united with Acustica, vol. 104, no. 5, pp. 737–740, Sep. 2018

work page 2018
[25]

Speech emotion recognition from variable-length inputs with triplet loss function,

J. Huang, Y. Li, J. Tao, and Z. Lian, “Speech emotion recognition from variable-length inputs with triplet loss function,” inProc. An- nual Conference of the International Speech Communication Association (INTERSPEECH), Hyderabad, India, 2018, pp. 3673–3677

work page 2018
[26]

Cross-domain feature learning in multimedia,

X. Yang, T. Zhang, and C. Xu, “Cross-domain feature learning in multimedia,” IEEE Transactions on Multimedia , vol. 17, no. 1, pp. 64–78, Jan. 2015

work page 2015
[27]

Learning consistent feature representation for cross-modal multimedia retrieval,

C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan, “Learning consistent feature representation for cross-modal multimedia retrieval,”IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 370–381, Mar. 2015

work page 2015
[28]

SoundNet: Learning sound representations from unlabeled video,

Y. Aytar, C. Vondrick, and A. Torralba, “SoundNet: Learning sound representations from unlabeled video,” in Proc. Advances in Neural Information Processing Systems (NIPS) , Barcelona, Spain, 2016, pp. 892–900

work page 2016
[29]

Adversarial cross-modal retrieval,

B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen, “Adversarial cross-modal retrieval,” in Proc. ACM International Converence on Multimedia (MM), Mountain View, CA, 2017, pp. 154–162

work page 2017
[30]

Emotion recognition in speech using cross-modal transfer in the wild,

S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emotion recognition in speech using cross-modal transfer in the wild,” in Proc. ACM International Conference on Multimedia (MM) , Seoul, Korea, 2018, pp. 292–301

work page 2018
[31]

Context-sensitive learning for enhanced au- diovisual emotion classiﬁcation,

A. Metallinou, M. W ¨ollmer, A. Katsamanis, F. Eyben, B. Schuller, and S. Narayanan, “Context-sensitive learning for enhanced au- diovisual emotion classiﬁcation,” IEEE Transactions on Affective Computing, vol. 3, no. 2, pp. 184–198, Jan. 2012

work page 2012
[32]

Anal- ysis of emotion recognition using facial expressions, speech and multimodal information,

C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Anal- ysis of emotion recognition using facial expressions, speech and multimodal information,” in Proc. International Conference on Mul- timodal Interfaces (ICMI), State College, PA, 2004, pp. 205–211

work page 2004
[33]

Low- level fusion of audio and video feature for multi-modal emotion recognition,

M. Wimmer, B. Schuller, D. Arsic, B. Radig, and G. Rigoll, “Low- level fusion of audio and video feature for multi-modal emotion recognition,” in Proc. International Conference on Computer Vision Theory and Applications (VISAPP), Funchal, Portugal, 2008, pp. 145– 151

work page 2008
[34]

Decision level com- bination of multiple modalities for recognition and analysis of emotional expression,

A. Metallinou, S. Lee, and S. Narayanan, “Decision level com- bination of multiple modalities for recognition and analysis of emotional expression,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Dallas, TX, 2010, pp. 2462–2465

work page 2010
[35]

Audio-visual affect recognition through multi- stream fused HMM for HCI,

Z. Zeng, J. Tu, B. Pianfetti, M. Liu, T. Zhang, Z. Zhang, T. S. Huang, and S. Levinson, “Audio-visual affect recognition through multi- stream fused HMM for HCI,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , San Diego, CA, 2005, pp. 967–972

work page 2005
[36]

Prediction-based learning for continuous emotion recognition in speech,

J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Prediction-based learning for continuous emotion recognition in speech,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), New Orleans, LA, 2017, pp. 5005–5009

work page 2017
[37]

Evaluating in- tegration strategies for visuo-haptic object recognition,

S. Toprak, N. Navarro-Guerrero, and S. Wermter, “Evaluating in- tegration strategies for visuo-haptic object recognition,” Cognitive computation, vol. 10, no. 3, pp. 408–425, June 2018

work page 2018
[38]

Deep metric learning using triplet net- work,

E. Hoffer and N. Ailon, “Deep metric learning using triplet net- work,” in Prof. International Workshop on Similarity-Based Pattern Recognition (SIMBAD), Copenhagen, Denmark, 2015, pp. 84–92

work page 2015
[39]

Learning deep structured semantic models for web search using click- through data,

P . Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning deep structured semantic models for web search using click- through data,” in Proc. International Conference on Information and Knowledge Management (CIKM), San Francisco, CA, 2013, pp. 2333– 2338

work page 2013
[40]

Trunk-branch ensemble convolutional neural networks for video-based face recognition,

C. Ding and D. Tao, “Trunk-branch ensemble convolutional neural networks for video-based face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 40, no. 4, pp. 1002– 1014, Apr. 2018

work page 2018
[41]

AVEC 2016: Depression, mood, and emotion recognition workshop and challenge,

M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “AVEC 2016: Depression, mood, and emotion recognition workshop and challenge,” in Proc. 6th International Workshop on Audio/Visual Emo- tion Challenge (AVEC), Amsterdam, The Netherlands, 2016, pp. 3– 10

work page 2016
[42]

AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition,

F. Ringeval, B. Schuller, M. Valstar, R. Cowie, H. Kaya, M. Schmitt, S. Amiriparian, N. Cummins, D. Lalanne, A. Michaud, E. Ciftc ¸i, H. G ¨ulec ¸, A. Salah, and M. Pantic, “AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition,” in Proc. 8th International Workshop on Audio/Visual Emotion Challenge (AVEC), Seoul, South K...

work page 2018
[43]

Introduc- ing the RECOLA multimodal corpus of remote collaborative and affective interactions,

F. Ringeval, A. Sonderegger, J. S. Sauer, and D. Lalanne, “Introduc- ing the RECOLA multimodal corpus of remote collaborative and affective interactions,” in Proc. 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) , Shanghai, China, 2013, pp. 1–8

work page 2013
[44]

AV+EC 2015: The ﬁrst affect recognition challenge bridging across audio, video, and physio- logical data,

F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic, “AV+EC 2015: The ﬁrst affect recognition challenge bridging across audio, video, and physio- logical data,” in Proc. 5th International Workshop on Audio/Visual Emotion Challenge (AVEC), Brisbane, Australia, 2015, pp. 3–8. 12

work page 2015
[45]

The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing,

F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Devillers, J. Epps, P . Laukka, S. Narayanan, and K. Truong, “The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, Apr. 2016

work page 2016
[46]

openSMILE – The Munich versatile and fast open-source audio feature extractor,

F. Eyben, M. W ¨ollmer, and B. Schuller, “openSMILE – The Munich versatile and fast open-source audio feature extractor,” in Proc. ACM International Conference on Multimedia (MM) , Florence, Italy, 2010, pp. 1459–1462

work page 2010
[47]

The OMG-emotion behavior dataset,

P . Barros, N. Churamani, E. Lakomkin, H. Siqueira, A. Sutherland, and S. Wermter, “The OMG-emotion behavior dataset,” inProc. In- ternational Joint Conference on Neural Networks (IJCNN) , Rio, Brazil, 2018, pp. 1408–1412

work page 2018
[48]

Joint face detection and alignment using multitask cascaded convolutional networks,

K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters , vol. 23, no. 10, pp. 1499–1503, Oct. 2016

work page 2016
[49]

Deep face recogni- tion,

O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recogni- tion,” in Proc. British Machine Vision Conference, Swansea, UK, 2015, pp. 1–12

work page 2015
[50]

On the properties of neural machine translation: Encoder-decoder approaches,

K. Cho, B. Van Merri ¨enboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” in Proc. Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST), Doha, Qatar, 2014, pp. 103–111

work page 2014
[51]

An empirical explo- ration of recurrent network architectures,

R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical explo- ration of recurrent network architectures,” in Proc. International Conference on Machine Learning (ICML) , Lille, France, 2015, pp. 2342–2350

work page 2015
[52]

Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,

S. Mariooryad and C. Busso, “Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,” IEEE Transactions on Affective Computing , vol. 6, no. 2, pp. 97–108, Apr. 2015

work page 2015
[53]

Cohen, P

J. Cohen, P . Cohen, S. G. West, and L. S. Aiken, Applied multiple regression/correlation analysis for the behavioral sciences . Abingdon, UK: Routledge, 2013

work page 2013
[54]

End-to-end multimodal emotion recognition using deep neural networks,

P . Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Pro- cessing, Special Issue on End-to-End Speech and Language Processing , vol. 11, no. 8, pp. 1301–1309, Dec. 2017

work page 2017
[55]

Reconstruction- error-based learning for continuous emotion recognition in speech,

J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Reconstruction- error-based learning for continuous emotion recognition in speech,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , New Orleans, LA, 2017, pp. 2367– 2371

work page 2017
[56]

Curriculum learning for speech emotion recognition from crowdsourced labels,

R. Lotﬁan and C. Busso, “Curriculum learning for speech emotion recognition from crowdsourced labels,” arXiv preprint arXiv:1805.10339, May 2018. Jing Han (S’16) received her bachelor de- gree (2011) in electronic and information en- gineering from Harbin Engineering University (HEU), China, and her master degree (2014) from Nanyang Technological Universi...

work page arXiv 2018

[1] [1]

Emotion recognition in human- computer interaction,

R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human- computer interaction,” IEEE Signal Processing Magazine , vol. 18, no. 1, pp. 32–80, Jan. 2001

work page 2001

[2] [2]

A survey of affect recognition methods: Audio, visual, and spontaneous expressions,

Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39–58, Jan. 2009

work page 2009

[3] [3]

Distributing recognition in computational paralinguistics,

Z. Zhang, E. Coutinho, J. Deng, and B. Schuller, “Distributing recognition in computational paralinguistics,” IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 406–417, Oct. 2014

work page 2014

[4] [4]

Affective computing and sentiment analysis,

E. Cambria, “Affective computing and sentiment analysis,” IEEE Intelligent Systems, vol. 31, no. 2, pp. 102–107, Mar. 2016

work page 2016

[5] [5]

Survey on speech emo- tion recognition: Features, classiﬁcation schemes, and databases,

M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emo- tion recognition: Features, classiﬁcation schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, Mar. 2011

work page 2011

[6] [6]

Automatic facial expression recogni- tion using features of salient facial patches,

S. L. Happy and A. Routray, “Automatic facial expression recogni- tion using features of salient facial patches,” IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 1–12, Jan. 2015. 11

work page 2015

[7] [7]

Sentiment analysis: From opinion mining to human-agent interaction,

C. Clavel and Z. Callejas, “Sentiment analysis: From opinion mining to human-agent interaction,” IEEE Transactions on Affective Computing, vol. 7, no. 1, pp. 74–93, Jan. 2016

work page 2016

[8] [8]

Emotions recognition using EEG signals: A survey,

S. M. Alarcao and M. J. Fonseca, “Emotions recognition using EEG signals: A survey,” IEEE Transactions on Affective Computing , June 2018, 20 pages

work page 2018

[9] [9]

Cross-corpus acoustic emotion recognition with multi-task learning: Seeking common ground while preserving differences,

B. Zhang, E. M. Provost, and G. Essl, “Cross-corpus acoustic emotion recognition with multi-task learning: Seeking common ground while preserving differences,” IEEE Transactions on Affec- tive Computing, Mar. 2017, 14 pages

work page 2017

[10] [10]

Adversarial training in affective computing and sentiment analysis: Recent advances and perspectives,

J. Han, Z. Zhang, N. Cummins, and B. Schuller, “Adversarial training in affective computing and sentiment analysis: Recent advances and perspectives,” IEEE Computational Intelligence Maga- zine, 2018, 13 pages

work page 2018

[11] [11]

Domain adversarial for acoustic emotion recognition,

M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12, pp. 2423–2435, Dec. 2018

work page 2018

[12] [12]

Continuous prediction of spontaneous affect from multiple cues and modalities in valence- arousal space,

M. Nicolaou, H. Gunes, and M. Pantic, “Continuous prediction of spontaneous affect from multiple cues and modalities in valence- arousal space,” IEEE Transactions on Affective Computing , vol. 2, no. 2, pp. 92–105, Apr. 2011

work page 2011

[13] [13]

Fusing audio, visual and textual clues for sentiment analysis from multimodal content,

S. Poria, E. Cambria, N. Howard, G.-B. Huang, and A. Hussain, “Fusing audio, visual and textual clues for sentiment analysis from multimodal content,” Neurocomputing, vol. 174, pp. 50–59, Jan. 2016

work page 2016

[14] [14]

Strength modelling for real-world automatic continuous affect recognition from audiovisual signals,

J. Han, Z. Zhang, N. Cummins, F. Ringeval, and B. Schuller, “Strength modelling for real-world automatic continuous affect recognition from audiovisual signals,” Image and Vision Computing, vol. 65, pp. 76–86, Sep. 2017

work page 2017

[15] [15]

Dynamic difﬁculty awareness training for continuous emotion prediction,

Z. Zhang, J. Han, and B. Schuller, “Dynamic difﬁculty awareness training for continuous emotion prediction,” IEEE Transactions on Multimedia, vol. PP , Sep. 2018, 13 pages

work page 2018

[16] [16]

Multimodal depression detection: Fusion analysis of paralinguistic, head pose and eye gaze behav- iors,

S. Alghowinem, R. Goecke, M. Wagner, J. Epps, M. Hyett, G. Parker, and M. Breakspear, “Multimodal depression detection: Fusion analysis of paralinguistic, head pose and eye gaze behav- iors,” IEEE Transactions on Affective Computing , vol. 9, no. 4, pp. 478–490, Oct. 2018

work page 2018

[17] [17]

Large scale online learning of image similarity through ranking,

G. Chechik, V . Sharma, U. Shalit, and S. Bengio, “Large scale online learning of image similarity through ranking,” Journal of Machine Learning Research, vol. 11, no. Mar., pp. 1109–1135, 2010

work page 2010

[18] [18]

FaceNet: A uniﬁed embedding for face recognition and clustering,

F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A uniﬁed embedding for face recognition and clustering,” in Proc. IEEE con- ference on Computer Vision and Pattern Recognition (CVPR) , Boston, MA, 2015, pp. 815–823

work page 2015

[19] [19]

A multitask approach to continuous ﬁve-dimensional affect sensing in natural speech,

F. Eyben, M. W ¨ollmer, and B. Schuller, “A multitask approach to continuous ﬁve-dimensional affect sensing in natural speech,” ACM Transactions on Interactive Intelligent Systems, vol. 2, no. 1, pp. 1–29, Mar. 2012

work page 2012

[20] [20]

A multi-task learning framework for emotion recognition using 2D continuous space,

R. Xia and Y. Liu, “A multi-task learning framework for emotion recognition using 2D continuous space,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 3–14, Jan. 2017

work page 2017

[21] [21]

From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty,

J. Han, Z. Zhang, M. Schmitt, and B. Schuller, “From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty,” in Proc. ACM International Conference on Multimedia (MM), Mountain View, CA, 2017, pp. 890–897

work page 2017

[22] [22]

Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,

J. Zhao, R. Li, S. Chen, and Q. Jin, “Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,” in Proc. 8th International Workshop on Audio/Visual Emotion Challenge (AVEC), Seoul, South Korea, 2018, pp. 65–72

work page 2018

[23] [23]

Personalized multitask learning for predicting tomorrow’s mood, stress, and health,

S. A. Taylor, N. Jaques, E. Nosakhare, A. Sano, and R. Picard, “Personalized multitask learning for predicting tomorrow’s mood, stress, and health,” IEEE Transactions on Affective Computing , Dec. 2018, 14 pages

work page 2018

[24] [24]

Emotion recognition in speech with latent discriminative representations learning,

J. Han, Z. Zhang, G. Keren, and B. Schuller, “Emotion recognition in speech with latent discriminative representations learning,” Acta Acustica united with Acustica, vol. 104, no. 5, pp. 737–740, Sep. 2018

work page 2018

[25] [25]

Speech emotion recognition from variable-length inputs with triplet loss function,

J. Huang, Y. Li, J. Tao, and Z. Lian, “Speech emotion recognition from variable-length inputs with triplet loss function,” inProc. An- nual Conference of the International Speech Communication Association (INTERSPEECH), Hyderabad, India, 2018, pp. 3673–3677

work page 2018

[26] [26]

Cross-domain feature learning in multimedia,

X. Yang, T. Zhang, and C. Xu, “Cross-domain feature learning in multimedia,” IEEE Transactions on Multimedia , vol. 17, no. 1, pp. 64–78, Jan. 2015

work page 2015

[27] [27]

Learning consistent feature representation for cross-modal multimedia retrieval,

C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan, “Learning consistent feature representation for cross-modal multimedia retrieval,”IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 370–381, Mar. 2015

work page 2015

[28] [28]

SoundNet: Learning sound representations from unlabeled video,

Y. Aytar, C. Vondrick, and A. Torralba, “SoundNet: Learning sound representations from unlabeled video,” in Proc. Advances in Neural Information Processing Systems (NIPS) , Barcelona, Spain, 2016, pp. 892–900

work page 2016

[29] [29]

Adversarial cross-modal retrieval,

B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen, “Adversarial cross-modal retrieval,” in Proc. ACM International Converence on Multimedia (MM), Mountain View, CA, 2017, pp. 154–162

work page 2017

[30] [30]

Emotion recognition in speech using cross-modal transfer in the wild,

S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emotion recognition in speech using cross-modal transfer in the wild,” in Proc. ACM International Conference on Multimedia (MM) , Seoul, Korea, 2018, pp. 292–301

work page 2018

[31] [31]

Context-sensitive learning for enhanced au- diovisual emotion classiﬁcation,

A. Metallinou, M. W ¨ollmer, A. Katsamanis, F. Eyben, B. Schuller, and S. Narayanan, “Context-sensitive learning for enhanced au- diovisual emotion classiﬁcation,” IEEE Transactions on Affective Computing, vol. 3, no. 2, pp. 184–198, Jan. 2012

work page 2012

[32] [32]

Anal- ysis of emotion recognition using facial expressions, speech and multimodal information,

C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Anal- ysis of emotion recognition using facial expressions, speech and multimodal information,” in Proc. International Conference on Mul- timodal Interfaces (ICMI), State College, PA, 2004, pp. 205–211

work page 2004

[33] [33]

Low- level fusion of audio and video feature for multi-modal emotion recognition,

M. Wimmer, B. Schuller, D. Arsic, B. Radig, and G. Rigoll, “Low- level fusion of audio and video feature for multi-modal emotion recognition,” in Proc. International Conference on Computer Vision Theory and Applications (VISAPP), Funchal, Portugal, 2008, pp. 145– 151

work page 2008

[34] [34]

Decision level com- bination of multiple modalities for recognition and analysis of emotional expression,

A. Metallinou, S. Lee, and S. Narayanan, “Decision level com- bination of multiple modalities for recognition and analysis of emotional expression,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Dallas, TX, 2010, pp. 2462–2465

work page 2010

[35] [35]

Audio-visual affect recognition through multi- stream fused HMM for HCI,

Z. Zeng, J. Tu, B. Pianfetti, M. Liu, T. Zhang, Z. Zhang, T. S. Huang, and S. Levinson, “Audio-visual affect recognition through multi- stream fused HMM for HCI,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , San Diego, CA, 2005, pp. 967–972

work page 2005

[36] [36]

Prediction-based learning for continuous emotion recognition in speech,

J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Prediction-based learning for continuous emotion recognition in speech,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), New Orleans, LA, 2017, pp. 5005–5009

work page 2017

[37] [37]

Evaluating in- tegration strategies for visuo-haptic object recognition,

S. Toprak, N. Navarro-Guerrero, and S. Wermter, “Evaluating in- tegration strategies for visuo-haptic object recognition,” Cognitive computation, vol. 10, no. 3, pp. 408–425, June 2018

work page 2018

[38] [38]

Deep metric learning using triplet net- work,

E. Hoffer and N. Ailon, “Deep metric learning using triplet net- work,” in Prof. International Workshop on Similarity-Based Pattern Recognition (SIMBAD), Copenhagen, Denmark, 2015, pp. 84–92

work page 2015

[39] [39]

Learning deep structured semantic models for web search using click- through data,

P . Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning deep structured semantic models for web search using click- through data,” in Proc. International Conference on Information and Knowledge Management (CIKM), San Francisco, CA, 2013, pp. 2333– 2338

work page 2013

[40] [40]

Trunk-branch ensemble convolutional neural networks for video-based face recognition,

C. Ding and D. Tao, “Trunk-branch ensemble convolutional neural networks for video-based face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 40, no. 4, pp. 1002– 1014, Apr. 2018

work page 2018

[41] [41]

AVEC 2016: Depression, mood, and emotion recognition workshop and challenge,

M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “AVEC 2016: Depression, mood, and emotion recognition workshop and challenge,” in Proc. 6th International Workshop on Audio/Visual Emo- tion Challenge (AVEC), Amsterdam, The Netherlands, 2016, pp. 3– 10

work page 2016

[42] [42]

AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition,

F. Ringeval, B. Schuller, M. Valstar, R. Cowie, H. Kaya, M. Schmitt, S. Amiriparian, N. Cummins, D. Lalanne, A. Michaud, E. Ciftc ¸i, H. G ¨ulec ¸, A. Salah, and M. Pantic, “AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition,” in Proc. 8th International Workshop on Audio/Visual Emotion Challenge (AVEC), Seoul, South K...

work page 2018

[43] [43]

Introduc- ing the RECOLA multimodal corpus of remote collaborative and affective interactions,

F. Ringeval, A. Sonderegger, J. S. Sauer, and D. Lalanne, “Introduc- ing the RECOLA multimodal corpus of remote collaborative and affective interactions,” in Proc. 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) , Shanghai, China, 2013, pp. 1–8

work page 2013

[44] [44]

AV+EC 2015: The ﬁrst affect recognition challenge bridging across audio, video, and physio- logical data,

F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic, “AV+EC 2015: The ﬁrst affect recognition challenge bridging across audio, video, and physio- logical data,” in Proc. 5th International Workshop on Audio/Visual Emotion Challenge (AVEC), Brisbane, Australia, 2015, pp. 3–8. 12

work page 2015

[45] [45]

The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing,

F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Devillers, J. Epps, P . Laukka, S. Narayanan, and K. Truong, “The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, Apr. 2016

work page 2016

[46] [46]

openSMILE – The Munich versatile and fast open-source audio feature extractor,

F. Eyben, M. W ¨ollmer, and B. Schuller, “openSMILE – The Munich versatile and fast open-source audio feature extractor,” in Proc. ACM International Conference on Multimedia (MM) , Florence, Italy, 2010, pp. 1459–1462

work page 2010

[47] [47]

The OMG-emotion behavior dataset,

P . Barros, N. Churamani, E. Lakomkin, H. Siqueira, A. Sutherland, and S. Wermter, “The OMG-emotion behavior dataset,” inProc. In- ternational Joint Conference on Neural Networks (IJCNN) , Rio, Brazil, 2018, pp. 1408–1412

work page 2018

[48] [48]

Joint face detection and alignment using multitask cascaded convolutional networks,

K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters , vol. 23, no. 10, pp. 1499–1503, Oct. 2016

work page 2016

[49] [49]

Deep face recogni- tion,

O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recogni- tion,” in Proc. British Machine Vision Conference, Swansea, UK, 2015, pp. 1–12

work page 2015

[50] [50]

On the properties of neural machine translation: Encoder-decoder approaches,

K. Cho, B. Van Merri ¨enboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” in Proc. Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST), Doha, Qatar, 2014, pp. 103–111

work page 2014

[51] [51]

An empirical explo- ration of recurrent network architectures,

R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical explo- ration of recurrent network architectures,” in Proc. International Conference on Machine Learning (ICML) , Lille, France, 2015, pp. 2342–2350

work page 2015

[52] [52]

Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,

S. Mariooryad and C. Busso, “Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,” IEEE Transactions on Affective Computing , vol. 6, no. 2, pp. 97–108, Apr. 2015

work page 2015

[53] [53]

Cohen, P

J. Cohen, P . Cohen, S. G. West, and L. S. Aiken, Applied multiple regression/correlation analysis for the behavioral sciences . Abingdon, UK: Routledge, 2013

work page 2013

[54] [54]

End-to-end multimodal emotion recognition using deep neural networks,

P . Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Pro- cessing, Special Issue on End-to-End Speech and Language Processing , vol. 11, no. 8, pp. 1301–1309, Dec. 2017

work page 2017

[55] [55]

Reconstruction- error-based learning for continuous emotion recognition in speech,

J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Reconstruction- error-based learning for continuous emotion recognition in speech,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , New Orleans, LA, 2017, pp. 2367– 2371

work page 2017

[56] [56]

Curriculum learning for speech emotion recognition from crowdsourced labels,

R. Lotﬁan and C. Busso, “Curriculum learning for speech emotion recognition from crowdsourced labels,” arXiv preprint arXiv:1805.10339, May 2018. Jing Han (S’16) received her bachelor de- gree (2011) in electronic and information en- gineering from Harbin Engineering University (HEU), China, and her master degree (2014) from Nanyang Technological Universi...

work page arXiv 2018