Multimodal and Multi-view Models for Emotion Recognition

Chao Wang; Gustavo Aguilar; Viktor Rozgi\'c; Weiran Wang

arxiv: 1906.10198 · v1 · pith:DUHYGSBFnew · submitted 2019-06-24 · 💻 cs.CL · cs.SD· eess.AS

Multimodal and Multi-view Models for Emotion Recognition

Gustavo Aguilar , Viktor Rozgi\'c , Weiran Wang , Chao Wang This is my paper

Pith reviewed 2026-05-25 17:12 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords emotion recognitionmultimodal learningmulti-view learningcontrastive lossacoustic featureslexical featuresIEMOCAPattention mechanisms

0 comments

The pith

Multimodal training on lexical and acoustic features lets an acoustic-only emotion model outperform purely acoustic baselines via contrastive loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish a practical way to combine lexical and acoustic information during training for emotion recognition while producing a model that needs only acoustic input at deployment time. It first tests multimodal models with attention mechanisms to measure how much lexical data helps. It then casts the problem as multi-view learning and applies a contrastive loss so that semantic information from the multimodal teacher transfers to an acoustic student network. If this holds, the resulting acoustic model beats models trained on acoustic features alone and the full multimodal version beats prior reported results on the USC-IEMOCAP dataset. Readers would care because real systems often cannot afford lexical processing at inference due to cost or privacy limits.

Core claim

Our multimodal model outperforms the previous state of the art on the USC-IEMOCAP dataset reported on lexical and acoustic information. Additionally, our multi-view-trained acoustic network significantly surpasses models that have been exclusively trained with acoustic features. The task is framed as a multi-view learning problem that induces semantic information from a multimodal model into an acoustic-only network using a contrastive loss function.

What carries the argument

The contrastive loss in the multi-view learning setup that transfers semantic information from the multimodal teacher to the acoustic student network.

If this is right

The multimodal model with lexical and acoustic inputs exceeds previously reported state-of-the-art accuracy on IEMOCAP.
The acoustic student trained under the multi-view contrastive objective exceeds the accuracy of acoustic models trained in isolation.
Attention mechanisms quantify the contribution of lexical information when both modalities are present during training.
The final acoustic model can be deployed without requiring lexical inputs such as ASR output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrastive transfer could be applied to other modality pairs where one input is unavailable or costly at inference.
The approach could reduce dependence on automatic speech recognition pipelines in deployed emotion recognition systems.
Repeating the multi-view training on additional emotion datasets would test whether the semantic transfer holds beyond the IEMOCAP splits.

Load-bearing premise

The contrastive loss successfully induces useful semantic information from the multimodal teacher into the acoustic student without lexical inputs at test time.

What would settle it

A direct comparison on a held-out IEMOCAP test split in which the multi-view acoustic model shows no accuracy gain over an acoustic-only baseline trained without the contrastive loss would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 1906.10198 by Chao Wang, Gustavo Aguilar, Viktor Rozgi\'c, Weiran Wang.

**Figure 1.** Figure 1: The multimodal model. The shadowed box incloses the acoustic word mechanism, whose output is fed into the GMU unit along with the lexical word representation at each timestep. The model can have N layers of BLSTM at the frame and word levels. where the frame features are used to generate the acoustic word representation. The high level of the model is where the word representations from each modality are c… view at source ↗

**Figure 2.** Figure 2: The multi-view models. The view on the left is the acoustic model, and the view on the right is the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Multimodal Attention. The figure shows the attention mechanisms at the modality and utterance levels. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Histogram of frame sequences for every ut [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Histogram of number of words per utterance. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Histogram of word lengths in terms of frames. B Experimental Settings We train all our models for 30 epochs using a learning rate of 1e-4 and a batch size of 64. The optimization of the models is conducted using Adam (Kingma and Ba, 2014). We consistently use gradient clipping among our experiments. We clip the norm of the gradient beyond 5 (Pascanu et al., 2012; Goodfellow et al., 2016): g ← gτ ||g|| if |… view at source ↗

**Figure 7.** Figure 7: Correct predictions (italics) of the model along with the attention visualization. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Incorrect predictions (italics) of the model along with the attention visualization. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Confusion matrix of the acoustic model B [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

Studies on emotion recognition (ER) show that combining lexical and acoustic information results in more robust and accurate models. The majority of the studies focus on settings where both modalities are available in training and evaluation. However, in practice, this is not always the case; getting ASR output may represent a bottleneck in a deployment pipeline due to computational complexity or privacy-related constraints. To address this challenge, we study the problem of efficiently combining acoustic and lexical modalities during training while still providing a deployable acoustic model that does not require lexical inputs. We first experiment with multimodal models and two attention mechanisms to assess the extent of the benefits that lexical information can provide. Then, we frame the task as a multi-view learning problem to induce semantic information from a multimodal model into our acoustic-only network using a contrastive loss function. Our multimodal model outperforms the previous state of the art on the USC-IEMOCAP dataset reported on lexical and acoustic information. Additionally, our multi-view-trained acoustic network significantly surpasses models that have been exclusively trained with acoustic features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses contrastive multi-view learning to train an acoustic-only ER model from multimodal data, claiming SOTA on IEMOCAP, but the abstract gives no numbers or controls so the gains cannot be checked.

read the letter

The main thing to know is that this work trains a multimodal model on lexical plus acoustic features for emotion recognition, then applies a contrastive loss to transfer useful representations into an acoustic-only student network. The goal is to keep the benefits of multimodal training while allowing deployment without lexical input or ASR at test time. They also test attention mechanisms in the multimodal stage. The abstract states the multimodal version beats prior reported results on IEMOCAP and the acoustic student beats models trained only on acoustics. That framing of the deployment constraint is the clearest contribution. The approach itself is a direct application of existing multi-view contrastive ideas rather than a new theoretical step. It does address a practical issue that comes up when lexical features are costly or restricted. The paper is short on specifics. No performance numbers appear in the abstract, no baseline descriptions, no mention of data splits, significance tests, or hyperparameter choices. Without those, the outperformance claims stay unverified. The transfer story also rests on the assumption that the contrastive loss actually moves semantic information into the acoustic model in a way that generalizes, but nothing in the provided text shows ablations or controls that would confirm this. The work is aimed at people already working on multimodal emotion recognition who need acoustic-only inference options. A reader in that niche might pick up the method and try it, but would have to re-implement and test to know if the gains hold. The idea is coherent enough on its own terms to warrant a full review rather than a desk reject, mainly so the experimental details can be examined.

Referee Report

2 major / 1 minor

Summary. The paper studies multimodal emotion recognition on the USC-IEMOCAP dataset by combining lexical and acoustic features with attention mechanisms, then frames the problem as multi-view learning to distill semantic information into an acoustic-only student model via contrastive loss. It claims the multimodal model outperforms prior SOTA and that the multi-view acoustic network significantly surpasses acoustic-only baselines.

Significance. If the empirical results hold with proper controls, the work would be significant for practical acoustic-only deployment scenarios where lexical input is unavailable due to cost or privacy constraints, showing that contrastive transfer can induce useful cross-modal information without requiring lexical features at test time.

major comments (2)

Abstract: the central claim that the multimodal model 'outperforms the previous state of the art' and that the multi-view acoustic network 'significantly surpasses' acoustic-only models is presented without any numerical results, listed baselines, data-split protocol, or statistical tests; this absence makes the primary empirical contribution unverifiable and load-bearing for acceptance.
Abstract / Methods (contrastive loss description): the mechanism by which the contrastive loss transfers semantic information from the multimodal teacher to the acoustic student is not formalized; without the exact loss equation, pair construction, or hyperparameters, it is impossible to assess whether the reported gains reduce to the specific IEMOCAP splits or attention choices rather than generalizable transfer.

minor comments (1)

Abstract: the dataset is called 'USC-IEMOCAP'; clarify whether this is the standard IEMOCAP corpus or a modified version and provide the exact citation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and verifiability of the claims.

read point-by-point responses

Referee: [—] Abstract: the central claim that the multimodal model 'outperforms the previous state of the art' and that the multi-view acoustic network 'significantly surpasses' acoustic-only models is presented without any numerical results, listed baselines, data-split protocol, or statistical tests; this absence makes the primary empirical contribution unverifiable and load-bearing for acceptance.

Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript we will add the key numerical results (accuracy and F1 improvements), explicitly list the baselines, state the IEMOCAP speaker-independent split protocol, and note the statistical significance tests performed. This will make the central claims verifiable directly from the abstract. revision: yes
Referee: [—] Abstract / Methods (contrastive loss description): the mechanism by which the contrastive loss transfers semantic information from the multimodal teacher to the acoustic student is not formalized; without the exact loss equation, pair construction, or hyperparameters, it is impossible to assess whether the reported gains reduce to the specific IEMOCAP splits or attention choices rather than generalizable transfer.

Authors: The full loss equation, positive/negative pair construction (same-utterance multimodal-acoustic pairs as positives), and hyperparameters appear in Section 3.2. To make the transfer mechanism immediately accessible and address the referee's concern, we will insert a concise formalization and reference to the equation into the abstract and/or the opening of the methods section in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical study of multimodal and multi-view models for emotion recognition on the public USC-IEMOCAP dataset. It reports performance comparisons of trained neural networks using attention mechanisms and contrastive loss, with no mathematical derivation chain, equations, or first-principles results claimed. All central claims reduce to direct experimental outcomes on held-out splits rather than any self-referential fitting, renaming, or self-citation load-bearing step. The work is therefore self-contained against external benchmarks with no circularity present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted. Approach implicitly relies on standard assumptions of neural network optimization and the transferability of representations via contrastive loss.

pith-pipeline@v0.9.0 · 5714 in / 1030 out tokens · 24212 ms · 2026-05-25T17:12:15.048407+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 9 internal anchors

[1]

Zakaria Aldeneh, Soheil Khorram, Dimitrios Dimitriadis, and Emily Mower Provost. 2017. https://doi.org/10.1145/3136755.3136760 Pooling acoustic and lexical features for the prediction of valence . In Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI 2017, pages 68--72, New York, NY, USA. ACM

work page doi:10.1145/3136755.3136760 2017
[2]

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. http://dl.acm.org/citation.cfm?id=3042817.3043076 Deep canonical correlation analysis . In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML'13, pages III--1247--III--1255. JMLR.org

work page arXiv 2013
[3]

John Arevalo, Thamar Solorio, Manuel Montes-y G \'o mez, and Fabio A Gonz \'a lez. 2017. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. http://arxiv.org/abs/1409.0473 Neural machine translation by jointly learning to align and translate . CoRR, abs/1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014
[5]

Adela Barbulescu, R \' e mi Ronfard, and G \' e rard Bailly. 2017. https://doi.org/10.1016/j.specom.2017.07.003 Which prosodic features contribute to the recognition of dramatic attitudes? Speech Communication, 95:78--86

work page doi:10.1016/j.specom.2017.07.003 2017
[6]

Avrim Blum and Tom Mitchell. 1998. https://doi.org/10.1145/279943.279962 Combining labeled and unlabeled data with co-training . In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT' 98, pages 92--100, New York, NY, USA. ACM

work page doi:10.1145/279943.279962 1998
[7]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335

work page 2008
[8]

Roddy Cowie. 2009. Perceiving emotion: towards a realistic understanding of the task. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 364(1535):3515--3525

work page 2009
[9]

Florian Eyben, Felix Weninger, Florian Gross, and Bj\" o rn Schuller. 2013. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM International Conference on Multimedia, MM '13, pages 835--838, New York, NY, USA. ACM

work page 2013
[10]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org

work page 2016
[11]

Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann. 2018. http://www.aclweb.org/anthology/N18-1193 Conversational memory network for emotion recognition in dyadic dialogue videos . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Hu...

work page 2018
[12]

Wanjia He, Weiran Wang, and Karen Livescu. 2016. http://arxiv.org/abs/1611.04496 Multi-view recurrent neural acoustic word embeddings . CoRR, abs/1611.04496

work page internal anchor Pith review Pith/arXiv arXiv 2016
[13]

Sergey Ioffe and Christian Szegedy. 2015. http://arxiv.org/abs/1502.03167 Batch normalization: Accelerating deep network training by reducing internal covariate shift . CoRR, abs/1502.03167

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

Qin Jin, Chengxin Li, Shizhe Chen, and Huimin Wu. 2015. https://doi.org/10.1109/ICASSP.2015.7178872 Speech emotion recognition with acoustic and lexical features . 2015:4749--4753

work page doi:10.1109/icassp.2015.7178872 2015
[15]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2014. http://arxiv.org/abs/1412.6980 Adam: A method for stochastic optimization . CoRR, abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[16]

Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong. 2014. https://www.microsoft.com/en-us/research/publication/learning-small-size-dnn-with-output-distribution-based-criteria/ Learning small-size dnn with output-distribution-based criteria . In Interspeech

work page 2014
[17]

Zheng Lian, Ya Li, Jianhua Tao, and Jian Huang. 2018. https://doi.org/10.1145/3267935.3267946 Speech emotion recognition via contrastive loss under siamese networks . In Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and First Multi-Modal Affective Computing of Large-Scale Multimedia Data, ASMMC-MMAC'18, pag...

work page doi:10.1145/3267935.3267946 2018
[18]

N Majumder, D Hazarika, Alexander Gelbukh, Erik Cambria, and Soujanya Poria. 2018. Multimodal sentiment analysis using hierarchical fusion with context modeling

work page 2018
[19]

Zhong Meng, Jinyu Li, Yifan Gong, and Biing-Hwang Juang. 2018. https://doi.org/10.1109/ICASSP.2018.8461682 Adversarial teacher-student learning for unsupervised domain adaptation . pages 5949--5953

work page doi:10.1109/icassp.2018.8461682 2018
[20]

Emily Mower Provost, Angeliki Metallinou, Chi-Chun Lee, Abe Kazemzadeh, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2009. https://doi.org/10.1109/ACII.2009.5349500 Interpreting ambiguous emotional expressions

work page doi:10.1109/acii.2009.5349500 2009
[21]

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2012. http://arxiv.org/abs/1211.5063 Understanding the exploding gradient problem . CoRR, abs/1211.5063

work page internal anchor Pith review Pith/arXiv arXiv 2012
[22]

Veronica Perez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. http://www.aclweb.org/anthology/P13-1096 Utterance-level multimodal sentiment analysis . In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 973--982, Sofia, Bulgaria. Association for Computational Linguistics

work page 2013
[23]

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. http://www.aclweb.org/anthology/N18-1202 Deep contextualized word representations . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (...

work page 2018
[24]

Soujanya Poria, Navonil Majumder, Devamanyu Hazarika, Erik Cambria, Amir Hussain, and Alexander F. Gelbukh. 2018. http://arxiv.org/abs/1803.07427 Multimodal sentiment analysis: Addressing key issues and setting up baselines . CoRR, abs/1803.07427

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Viktor Rozgic, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, and Rohit Prasad. 2012. Ensemble of svm trees for multimodal emotion recognition. In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific, pages 1--4. IEEE

work page 2012
[26]

Bj \"o rn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik Marchi, et al. 2013. The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In Proceedings INTERSPEECH 2013, 14th Annual Conference of the Intern...

work page 2013
[27]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. http://dl.acm.org/citation.cfm?id=2627435.2670313 Dropout: A simple way to prevent neural networks from overfitting . J. Mach. Learn. Res., 15(1):1929--1958

work page arXiv 2014
[28]

Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. 2015. On deep multi-view representation learning. In International Conference on Machine Learning, pages 1083--1092

work page 2015
[29]

Chang Xu, Dacheng Tao, and Chao Xu. 2013. http://arxiv.org/abs/1304.5634 A survey on multi-view learning . CoRR, abs/1304.5634

work page internal anchor Pith review Pith/arXiv arXiv 2013
[30]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis - Philippe Morency. 2017. http://arxiv.org/abs/1707.07250 Tensor fusion network for multimodal sentiment analysis . CoRR, abs/1707.07250

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[32]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Zakaria Aldeneh, Soheil Khorram, Dimitrios Dimitriadis, and Emily Mower Provost. 2017. https://doi.org/10.1145/3136755.3136760 Pooling acoustic and lexical features for the prediction of valence . In Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI 2017, pages 68--72, New York, NY, USA. ACM

work page doi:10.1145/3136755.3136760 2017

[2] [2]

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. http://dl.acm.org/citation.cfm?id=3042817.3043076 Deep canonical correlation analysis . In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML'13, pages III--1247--III--1255. JMLR.org

work page arXiv 2013

[3] [3]

John Arevalo, Thamar Solorio, Manuel Montes-y G \'o mez, and Fabio A Gonz \'a lez. 2017. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. http://arxiv.org/abs/1409.0473 Neural machine translation by jointly learning to align and translate . CoRR, abs/1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014

[5] [5]

Adela Barbulescu, R \' e mi Ronfard, and G \' e rard Bailly. 2017. https://doi.org/10.1016/j.specom.2017.07.003 Which prosodic features contribute to the recognition of dramatic attitudes? Speech Communication, 95:78--86

work page doi:10.1016/j.specom.2017.07.003 2017

[6] [6]

Avrim Blum and Tom Mitchell. 1998. https://doi.org/10.1145/279943.279962 Combining labeled and unlabeled data with co-training . In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT' 98, pages 92--100, New York, NY, USA. ACM

work page doi:10.1145/279943.279962 1998

[7] [7]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335

work page 2008

[8] [8]

Roddy Cowie. 2009. Perceiving emotion: towards a realistic understanding of the task. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 364(1535):3515--3525

work page 2009

[9] [9]

Florian Eyben, Felix Weninger, Florian Gross, and Bj\" o rn Schuller. 2013. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM International Conference on Multimedia, MM '13, pages 835--838, New York, NY, USA. ACM

work page 2013

[10] [10]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org

work page 2016

[11] [11]

Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann. 2018. http://www.aclweb.org/anthology/N18-1193 Conversational memory network for emotion recognition in dyadic dialogue videos . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Hu...

work page 2018

[12] [12]

Wanjia He, Weiran Wang, and Karen Livescu. 2016. http://arxiv.org/abs/1611.04496 Multi-view recurrent neural acoustic word embeddings . CoRR, abs/1611.04496

work page internal anchor Pith review Pith/arXiv arXiv 2016

[13] [13]

Sergey Ioffe and Christian Szegedy. 2015. http://arxiv.org/abs/1502.03167 Batch normalization: Accelerating deep network training by reducing internal covariate shift . CoRR, abs/1502.03167

work page internal anchor Pith review Pith/arXiv arXiv 2015

[14] [14]

Qin Jin, Chengxin Li, Shizhe Chen, and Huimin Wu. 2015. https://doi.org/10.1109/ICASSP.2015.7178872 Speech emotion recognition with acoustic and lexical features . 2015:4749--4753

work page doi:10.1109/icassp.2015.7178872 2015

[15] [15]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2014. http://arxiv.org/abs/1412.6980 Adam: A method for stochastic optimization . CoRR, abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[16] [16]

Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong. 2014. https://www.microsoft.com/en-us/research/publication/learning-small-size-dnn-with-output-distribution-based-criteria/ Learning small-size dnn with output-distribution-based criteria . In Interspeech

work page 2014

[17] [17]

Zheng Lian, Ya Li, Jianhua Tao, and Jian Huang. 2018. https://doi.org/10.1145/3267935.3267946 Speech emotion recognition via contrastive loss under siamese networks . In Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and First Multi-Modal Affective Computing of Large-Scale Multimedia Data, ASMMC-MMAC'18, pag...

work page doi:10.1145/3267935.3267946 2018

[18] [18]

N Majumder, D Hazarika, Alexander Gelbukh, Erik Cambria, and Soujanya Poria. 2018. Multimodal sentiment analysis using hierarchical fusion with context modeling

work page 2018

[19] [19]

Zhong Meng, Jinyu Li, Yifan Gong, and Biing-Hwang Juang. 2018. https://doi.org/10.1109/ICASSP.2018.8461682 Adversarial teacher-student learning for unsupervised domain adaptation . pages 5949--5953

work page doi:10.1109/icassp.2018.8461682 2018

[20] [20]

Emily Mower Provost, Angeliki Metallinou, Chi-Chun Lee, Abe Kazemzadeh, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2009. https://doi.org/10.1109/ACII.2009.5349500 Interpreting ambiguous emotional expressions

work page doi:10.1109/acii.2009.5349500 2009

[21] [21]

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2012. http://arxiv.org/abs/1211.5063 Understanding the exploding gradient problem . CoRR, abs/1211.5063

work page internal anchor Pith review Pith/arXiv arXiv 2012

[22] [22]

Veronica Perez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. http://www.aclweb.org/anthology/P13-1096 Utterance-level multimodal sentiment analysis . In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 973--982, Sofia, Bulgaria. Association for Computational Linguistics

work page 2013

[23] [23]

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. http://www.aclweb.org/anthology/N18-1202 Deep contextualized word representations . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (...

work page 2018

[24] [24]

Soujanya Poria, Navonil Majumder, Devamanyu Hazarika, Erik Cambria, Amir Hussain, and Alexander F. Gelbukh. 2018. http://arxiv.org/abs/1803.07427 Multimodal sentiment analysis: Addressing key issues and setting up baselines . CoRR, abs/1803.07427

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

Viktor Rozgic, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, and Rohit Prasad. 2012. Ensemble of svm trees for multimodal emotion recognition. In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific, pages 1--4. IEEE

work page 2012

[26] [26]

Bj \"o rn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik Marchi, et al. 2013. The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In Proceedings INTERSPEECH 2013, 14th Annual Conference of the Intern...

work page 2013

[27] [27]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. http://dl.acm.org/citation.cfm?id=2627435.2670313 Dropout: A simple way to prevent neural networks from overfitting . J. Mach. Learn. Res., 15(1):1929--1958

work page arXiv 2014

[28] [28]

Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. 2015. On deep multi-view representation learning. In International Conference on Machine Learning, pages 1083--1092

work page 2015

[29] [29]

Chang Xu, Dacheng Tao, and Chao Xu. 2013. http://arxiv.org/abs/1304.5634 A survey on multi-view learning . CoRR, abs/1304.5634

work page internal anchor Pith review Pith/arXiv arXiv 2013

[30] [30]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis - Philippe Morency. 2017. http://arxiv.org/abs/1707.07250 Tensor fusion network for multimodal sentiment analysis . CoRR, abs/1707.07250

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[32] [32]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page