Attending to Emotional Narratives

Desmond C. Ong; Jamil Zaki; Tan Zhi-Xuan; Xiyu Zhang; Zhengxuan Wu

arxiv: 1907.04197 · v1 · pith:CAZ52ZL5new · submitted 2019-07-08 · 💻 cs.LG · cs.CL· stat.ML

Attending to Emotional Narratives

Zhengxuan Wu , Xiyu Zhang , Tan Zhi-Xuan , Jamil Zaki , Desmond C. Ong This is my paper

Pith reviewed 2026-05-25 01:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords attention mechanismsemotion recognitionTransformerMemory Fusion Networkmultimodal time-seriesautobiographical narrativesemotional valenceaffective computing

0 comments

The pith

Attention mechanisms like the Transformer generalize to multimodal emotion recognition on autobiographical narratives, sometimes matching human raters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that attention-based mechanisms originally developed for sequence prediction tasks, specifically the Transformer's parallelizable self-attention and the Memory Fusion Network's cross-modal and temporal attention, also apply effectively to predicting emotional valence from multimodal time-series data. It demonstrates this by adapting the models to a dataset of emotional autobiographical narratives. A sympathetic reader would care because successful generalization would mean these architectures can capture how emotions unfold over time across different input channels without needing entirely new designs. This points toward more flexible tools for tracking emotional states in narrative data.

Core claim

The central claim is that the Transformer with its parallelizable self-attention layers and the Memory Fusion Network with attention across modalities and time generalize well to multimodal time-series emotion recognition. Using a recently-introduced dataset of emotional autobiographical narratives, the models predict emotional valence over time and in some cases reach performance comparable with human raters. The paper ends by discussing the implications of attention mechanisms for affective computing.

What carries the argument

The Transformer model using parallelizable self-attention layers for sequence processing and the Memory Fusion Network using attention across modalities and time for multimodal fusion, both applied to valence prediction.

If this is right

The attention mechanisms achieve high performance when predicting emotional valence over time in the narratives.
In some cases the models reach performance levels comparable to human raters.
The same architectures apply across sequence prediction and emotion recognition without major redesign.
Attention mechanisms carry implications for building systems in affective computing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results raise the question of whether the same attention layers would transfer to other time-series emotion tasks such as speech or video without further changes.
One could examine whether the parallel self-attention specifically helps capture long-range emotional arcs in longer narratives.
The approach might extend to real-time applications where emotional valence must be tracked from ongoing multimodal input streams.

Load-bearing premise

The models can be adapted to the multimodal time-series emotion recognition task on the autobiographical narratives dataset in a way that allows valid comparison to human raters.

What would settle it

Retraining the Transformer and Memory Fusion Network on the emotional autobiographical narratives dataset and measuring their output against human valence ratings, where the models fail to reach comparable agreement levels.

Figures

Figures reproduced from arXiv: 1907.04197 by Desmond C. Ong, Jamil Zaki, Tan Zhi-Xuan, Xiyu Zhang, Zhengxuan Wu.

**Figure 1.** Figure 1: Diagramatic overview of our two modelling approaches. (a) Simple Fusion Transformer. (b) Memory Fusion Transformer. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Schematic of the basic Transformer architecture [20] we employed. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Sample of the best-performing non-unimodal model predictions (SFT: [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Out-of-sample emotion valence prediction of the speech [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Attention mechanisms in deep neural networks have achieved excellent performance on sequence-prediction tasks. Here, we show that these recently-proposed attention-based mechanisms---in particular, the Transformer with its parallelizable self-attention layers, and the Memory Fusion Network with attention across modalities and time---also generalize well to multimodal time-series emotion recognition. Using a recently-introduced dataset of emotional autobiographical narratives, we adapt and apply these two attention mechanisms to predict emotional valence over time. Our models perform extremely well, in some cases reaching a performance comparable with human raters. We end with a discussion of the implications of attention mechanisms to affective computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts Transformer and Memory Fusion Network to multimodal narrative emotion prediction and gets credible results, but stays within an application of existing attention tools.

read the letter

The main point is that Transformer and Memory Fusion Network attention mechanisms can be adapted to predict emotional valence over time from multimodal autobiographical narratives, and the models reach performance levels close to human raters on this dataset. The full paper supplies the architecture changes, training protocol, metrics, and dataset splits that the abstract omitted, so the claims are checkable rather than vague assertions. They handle the time-series and cross-modal aspects with self-attention and fusion attention in ways that make sense for the task. The human rater comparison is included and helps ground the numbers. This is a straightforward extension of recent attention work into affective computing on narrative data, and the experiments are described with enough concreteness to evaluate the adaptation. The results support the generalization claim within the scope of this dataset. The softer spots are that the contribution is mostly application-level rather than a new mechanism or theoretical insight about attention. The autobiographical narratives dataset has its own structure, so how far the findings extend to other emotion recognition settings is not tested here. Human-comparable performance also depends on the exact rating instructions and agreement levels, which are reported but worth weighing carefully. This paper is for researchers in affective computing or multimodal time-series modeling who want to see attention models tried on story-based emotion data and need implementation details. A reader working on emotion-aware systems would get practical value from the benchmarks and adaptations. It deserves peer review because the experiments are grounded and the central performance claim can be assessed directly from the provided details.

Referee Report

0 major / 3 minor

Summary. The manuscript adapts two attention-based architectures—the Transformer (with self-attention) and the Memory Fusion Network (with cross-modal and temporal attention)—to the task of predicting continuous emotional valence from multimodal time-series data in a dataset of autobiographical narratives. The authors report that both models achieve strong performance, in some cases approaching human-rater levels, and discuss implications for affective computing.

Significance. If the reported performance numbers and human comparisons are reproducible, the work supplies concrete evidence that recently developed attention mechanisms transfer to multimodal affective time-series tasks without task-specific redesign, which is a useful data point for the affective-computing community.

minor comments (3)

§3.2 and Table 2: the description of the MFN temporal attention module should explicitly state the dimensionality of the memory bank and the number of attention heads used in the experiments; these hyperparameters are referenced in the results but not defined in the model section.
Figure 3: the y-axis label on the valence-prediction plots should indicate the exact range and scaling (e.g., [-1,1] or [0,1]) to allow direct comparison with the human-rater baselines shown in the same figure.
§4.3: the sentence claiming “performance comparable with human raters” should be qualified by the specific metric (CCC or MSE) and the exact human inter-rater value; the current phrasing is slightly stronger than the numbers in Table 3 support.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments appear in the report, so we have no individual points to address.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The manuscript applies existing attention models (Transformer, Memory Fusion Network) to an emotion-recognition task on a new dataset and reports empirical results. No equations, derivations, or fitted quantities are introduced that could reduce to inputs by construction. Claims rest on standard model adaptation and performance metrics rather than any self-definitional, self-citation load-bearing, or renaming steps. The paper is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are identifiable. The work relies on standard attention mechanisms from prior literature without introducing new ones.

pith-pipeline@v0.9.0 · 5638 in / 1124 out tokens · 30503 ms · 2026-05-25T01:33:18.165585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 5 internal anchors

[1]

Empathy: Its ultimate and proximate bases,

S. D. Preston and F. B. De Waal, “Empathy: Its ultimate and proximate bases,” Behavioral and Brain Sciences , vol. 25, no. 1, pp. 1–20, 2002

work page 2002
[2]

Empathy and well-being correlate with centrality in different social networks,

S. A. Morelli, D. C. Ong, R. Makati, M. O. Jackson, and J. Zaki, “Empathy and well-being correlate with centrality in different social networks,” Proceedings of the National Academy of Sciences , vol. 114, no. 37, pp. 9843–9847, 2017

work page 2017
[3]

Affective cognition: Exploring lay theories of emotion,

D. C. Ong, J. Zaki, and N. D. Goodman, “Affective cognition: Exploring lay theories of emotion,” Cognition, vol. 143, pp. 141–162, 2015

work page 2015
[4]

A survey of affect recognition methods: Audio, visual, and spontaneous expressions,

Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 31, no. 1, pp. 39–58, 2009

work page 2009
[5]

A review of affective computing: From unimodal analysis to multimodal fusion,

S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information Fusion, vol. 37, pp. 98–125, 2017

work page 2017
[6]

Emotion recognition in the wild via convo- lutional neural networks and mapped binary patterns,

G. Levi and T. Hassner, “Emotion recognition in the wild via convo- lutional neural networks and mapped binary patterns,” in Proceedings of the 2015 ACM International Conference on Multimodal Interaction . ACM, 2015, pp. 503–510

work page 2015
[7]

Deep convolutional neural networks for sentiment analysis of short texts,

C. Dos Santos and M. Gatti, “Deep convolutional neural networks for sentiment analysis of short texts,” in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 69–78

work page 2014
[8]

Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space,

M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space,” IEEE Transactions on Affective Computing , vol. 2, no. 2, pp. 92–105, 2011

work page 2011
[9]

Modeling emotion in complex stories: the Stanford Emotional Narratives Dataset,

D. C. Ong, Z. Wu, T. Zhi-Xuan, M. Reddan, I. Kahhale, A. Mattek, and J. Zaki, “Modeling emotion in complex stories: the Stanford Emotional Narratives Dataset,” Invited Revision to Journal

work page
[10]

A learning algorithm for continually running fully recurrent neural networks,

R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation , vol. 1, no. 2, pp. 270–280, 1989

work page 1989
[11]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[12]

Recur- rent neural networks for emotion recognition in video,

S. E. Kahou, V . Michalski, K. Konda, R. Memisevic, and C. Pal, “Recur- rent neural networks for emotion recognition in video,” in Proceedings of the ACM International Conference on Multimodal Interaction , 2015, pp. 467–474

work page 2015
[13]

How deep neural networks can improve emotion recognition on video data,

P. Khorrami, T. Le Paine, K. Brady, C. Dagli, and T. S. Huang, “How deep neural networks can improve emotion recognition on video data,” in 2016 IEEE International Conference on Image Processing (ICIP) . IEEE, 2016, pp. 619–623

work page 2016
[14]

Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies,

M. W ¨ollmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas- Cowie, and R. Cowie, “Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies,” in Proceedings Interspeech, 2008, pp. 597–600

work page 2008
[15]

On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues,

F. Eyben, M. W ¨ollmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie, “On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues,” Journal on Multimodal User Interfaces, vol. 3, no. 1-2, pp. 7–19, 2010

work page 2010
[16]

Lstm- modeling of continuous emotions in an audiovisual affect recognition framework,

M. W ¨ollmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll, “Lstm- modeling of continuous emotions in an audiovisual affect recognition framework,” Image and Vision Computing , vol. 31, no. 2, pp. 153–163, 2013

work page 2013
[17]

A multimodal lstm for predicting listener empathic responses over time,

Z. X. Tan, A. Goel, T.-S. Nguyen, and D. C. Ong, “A multimodal lstm for predicting listener empathic responses over time,” in OMG- Empathy Challenge workshop at the 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG) , 2019

work page 2019
[18]

Effective approaches to attention-based neural machine translation,

T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Empirical Methods in Natural Language Processing (EMNLP) , 2015, pp. 1412–1421

work page 2015
[19]

Neural machine translation by jointly learning to align and translate,

D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in Proceedings of the 4th International Conference on Learning Representations (ICLR) , 2015

work page 2015
[20]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems , 2017, pp. 5998–6008

work page 2017
[21]

QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

A. W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V . Le, “Qanet: Combining local convolution with global self-attention for reading comprehension,” arXiv preprint arXiv:1804.09541 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Non-Autoregressive Neural Machine Translation

J. Gu, J. Bradbury, C. Xiong, V . O. Li, and R. Socher, “Non-autoregressive neural machine translation,” arXiv preprint arXiv:1711.02281, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” Technical re- port, OpenAi, Tech. Rep., 2018

work page 2018
[25]

Automatic speech emotion recognition using recurrent neural networks with local attention,

S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 2227–2231

work page 2017
[26]

Memory fusion network for multi-view sequential learning,

A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L.-P. Morency, “Memory fusion network for multi-view sequential learning,” in Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018

work page 2018
[27]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[28]

The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. P. Truong, “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016

work page 2016
[29]

Recent developments in openSMILE, the Munich open-source multimedia feature extractor,

F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in openSMILE, the Munich open-source multimedia feature extractor,” in Proceedings of the 21st ACM International Conference on Multime- dia, 2013, pp. 835–838

work page 2013
[30]

Glove: Global vectors for word representation,

J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543

work page 2014
[31]

Affect recognition from face and body: early fusion vs. late fusion,

H. Gunes and M. Piccardi, “Affect recognition from face and body: early fusion vs. late fusion,” in IEEE International Conference on Systems, Man and Cybernetics , vol. 4, 2005, pp. 3437–3443

work page 2005
[32]

Early versus late fusion in semantic video analysis,

C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late fusion in semantic video analysis,” in Proceedings of the 13th Annual ACM International Conference on Multimedia , 2005, pp. 399–402

work page 2005
[33]

Long short term memory recurrent neural network based multimodal dimensional emo- tion recognition,

L. Chao, J. Tao, M. Yang, Y . Li, and Z. Wen, “Long short term memory recurrent neural network based multimodal dimensional emo- tion recognition,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 65–72

work page 2015
[34]

Multi-modal dimensional emotion recognition using recurrent neural networks,

S. Chen and Q. Jin, “Multi-modal dimensional emotion recognition using recurrent neural networks,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 49–56

work page 2015
[35]

Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,

K. Brady, Y . Gwon, P. Khorrami, E. Godoy, W. Campbell, C. Dagli, and T. S. Huang, “Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge , 2016, pp. 97–104

work page 2016
[36]

Highway Networks

R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[37]

Computational affective cognition: Modeling reasoning about emotions,

D. C. Ong, “Computational affective cognition: Modeling reasoning about emotions,” Ph.D. dissertation, Stanford University, 2017

work page 2017
[38]

Primitives-based evaluation and estimation of emotions in speech,

M. Grimm, K. Kroschel, E. Mower, and S. Narayanan, “Primitives-based evaluation and estimation of emotions in speech,” Speech Communica- tion, vol. 49, no. 10-11, pp. 787–800, 2007

work page 2007
[39]

A concordance correlation coefﬁcient to evaluate repro- ducibility,

L. I.-K. Lin, “A concordance correlation coefﬁcient to evaluate repro- ducibility,” Biometrics, pp. 255–268, 1989

work page 1989
[40]

Avec 2016: Depression, mood, and emotion recognition workshop and challenge,

M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “Avec 2016: Depression, mood, and emotion recognition workshop and challenge,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 2016, pp. 3–10

work page 2016
[41]

A VEC 2017: Real-life depression, and affect recognition workshop and challenge,

F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “A VEC 2017: Real-life depression, and affect recognition workshop and challenge,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, 2017, pp. 3–9

work page 2017
[42]

Computational models of emotion inference in theory of mind: A review and roadmap,

D. C. Ong, J. Zaki, and N. D. Goodman, “Computational models of emotion inference in theory of mind: A review and roadmap,” Topics in Cognitive Science, vol. 11, no. 2, pp. 338–357, 2019

work page 2019
[43]

Cue integration: A common framework for social cognition and physical perception,

J. Zaki, “Cue integration: A common framework for social cognition and physical perception,” Perspectives on Psychological Science, vol. 8, no. 3, pp. 296–312, 2013

work page 2013

[1] [1]

Empathy: Its ultimate and proximate bases,

S. D. Preston and F. B. De Waal, “Empathy: Its ultimate and proximate bases,” Behavioral and Brain Sciences , vol. 25, no. 1, pp. 1–20, 2002

work page 2002

[2] [2]

Empathy and well-being correlate with centrality in different social networks,

S. A. Morelli, D. C. Ong, R. Makati, M. O. Jackson, and J. Zaki, “Empathy and well-being correlate with centrality in different social networks,” Proceedings of the National Academy of Sciences , vol. 114, no. 37, pp. 9843–9847, 2017

work page 2017

[3] [3]

Affective cognition: Exploring lay theories of emotion,

D. C. Ong, J. Zaki, and N. D. Goodman, “Affective cognition: Exploring lay theories of emotion,” Cognition, vol. 143, pp. 141–162, 2015

work page 2015

[4] [4]

A survey of affect recognition methods: Audio, visual, and spontaneous expressions,

Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 31, no. 1, pp. 39–58, 2009

work page 2009

[5] [5]

A review of affective computing: From unimodal analysis to multimodal fusion,

S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information Fusion, vol. 37, pp. 98–125, 2017

work page 2017

[6] [6]

Emotion recognition in the wild via convo- lutional neural networks and mapped binary patterns,

G. Levi and T. Hassner, “Emotion recognition in the wild via convo- lutional neural networks and mapped binary patterns,” in Proceedings of the 2015 ACM International Conference on Multimodal Interaction . ACM, 2015, pp. 503–510

work page 2015

[7] [7]

Deep convolutional neural networks for sentiment analysis of short texts,

C. Dos Santos and M. Gatti, “Deep convolutional neural networks for sentiment analysis of short texts,” in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 69–78

work page 2014

[8] [8]

Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space,

M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space,” IEEE Transactions on Affective Computing , vol. 2, no. 2, pp. 92–105, 2011

work page 2011

[9] [9]

Modeling emotion in complex stories: the Stanford Emotional Narratives Dataset,

D. C. Ong, Z. Wu, T. Zhi-Xuan, M. Reddan, I. Kahhale, A. Mattek, and J. Zaki, “Modeling emotion in complex stories: the Stanford Emotional Narratives Dataset,” Invited Revision to Journal

work page

[10] [10]

A learning algorithm for continually running fully recurrent neural networks,

R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation , vol. 1, no. 2, pp. 270–280, 1989

work page 1989

[11] [11]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[12] [12]

Recur- rent neural networks for emotion recognition in video,

S. E. Kahou, V . Michalski, K. Konda, R. Memisevic, and C. Pal, “Recur- rent neural networks for emotion recognition in video,” in Proceedings of the ACM International Conference on Multimodal Interaction , 2015, pp. 467–474

work page 2015

[13] [13]

How deep neural networks can improve emotion recognition on video data,

P. Khorrami, T. Le Paine, K. Brady, C. Dagli, and T. S. Huang, “How deep neural networks can improve emotion recognition on video data,” in 2016 IEEE International Conference on Image Processing (ICIP) . IEEE, 2016, pp. 619–623

work page 2016

[14] [14]

Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies,

M. W ¨ollmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas- Cowie, and R. Cowie, “Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies,” in Proceedings Interspeech, 2008, pp. 597–600

work page 2008

[15] [15]

On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues,

F. Eyben, M. W ¨ollmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie, “On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues,” Journal on Multimodal User Interfaces, vol. 3, no. 1-2, pp. 7–19, 2010

work page 2010

[16] [16]

Lstm- modeling of continuous emotions in an audiovisual affect recognition framework,

M. W ¨ollmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll, “Lstm- modeling of continuous emotions in an audiovisual affect recognition framework,” Image and Vision Computing , vol. 31, no. 2, pp. 153–163, 2013

work page 2013

[17] [17]

A multimodal lstm for predicting listener empathic responses over time,

Z. X. Tan, A. Goel, T.-S. Nguyen, and D. C. Ong, “A multimodal lstm for predicting listener empathic responses over time,” in OMG- Empathy Challenge workshop at the 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG) , 2019

work page 2019

[18] [18]

Effective approaches to attention-based neural machine translation,

T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Empirical Methods in Natural Language Processing (EMNLP) , 2015, pp. 1412–1421

work page 2015

[19] [19]

Neural machine translation by jointly learning to align and translate,

D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in Proceedings of the 4th International Conference on Learning Representations (ICLR) , 2015

work page 2015

[20] [20]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems , 2017, pp. 5998–6008

work page 2017

[21] [21]

QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

A. W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V . Le, “Qanet: Combining local convolution with global self-attention for reading comprehension,” arXiv preprint arXiv:1804.09541 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[23] [23]

Non-Autoregressive Neural Machine Translation

J. Gu, J. Bradbury, C. Xiong, V . O. Li, and R. Socher, “Non-autoregressive neural machine translation,” arXiv preprint arXiv:1711.02281, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” Technical re- port, OpenAi, Tech. Rep., 2018

work page 2018

[25] [25]

Automatic speech emotion recognition using recurrent neural networks with local attention,

S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 2227–2231

work page 2017

[26] [26]

Memory fusion network for multi-view sequential learning,

A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L.-P. Morency, “Memory fusion network for multi-view sequential learning,” in Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018

work page 2018

[27] [27]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[28] [28]

The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. P. Truong, “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016

work page 2016

[29] [29]

Recent developments in openSMILE, the Munich open-source multimedia feature extractor,

F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in openSMILE, the Munich open-source multimedia feature extractor,” in Proceedings of the 21st ACM International Conference on Multime- dia, 2013, pp. 835–838

work page 2013

[30] [30]

Glove: Global vectors for word representation,

J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543

work page 2014

[31] [31]

Affect recognition from face and body: early fusion vs. late fusion,

H. Gunes and M. Piccardi, “Affect recognition from face and body: early fusion vs. late fusion,” in IEEE International Conference on Systems, Man and Cybernetics , vol. 4, 2005, pp. 3437–3443

work page 2005

[32] [32]

Early versus late fusion in semantic video analysis,

C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late fusion in semantic video analysis,” in Proceedings of the 13th Annual ACM International Conference on Multimedia , 2005, pp. 399–402

work page 2005

[33] [33]

Long short term memory recurrent neural network based multimodal dimensional emo- tion recognition,

L. Chao, J. Tao, M. Yang, Y . Li, and Z. Wen, “Long short term memory recurrent neural network based multimodal dimensional emo- tion recognition,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 65–72

work page 2015

[34] [34]

Multi-modal dimensional emotion recognition using recurrent neural networks,

S. Chen and Q. Jin, “Multi-modal dimensional emotion recognition using recurrent neural networks,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 49–56

work page 2015

[35] [35]

Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,

K. Brady, Y . Gwon, P. Khorrami, E. Godoy, W. Campbell, C. Dagli, and T. S. Huang, “Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge , 2016, pp. 97–104

work page 2016

[36] [36]

Highway Networks

R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[37] [37]

Computational affective cognition: Modeling reasoning about emotions,

D. C. Ong, “Computational affective cognition: Modeling reasoning about emotions,” Ph.D. dissertation, Stanford University, 2017

work page 2017

[38] [38]

Primitives-based evaluation and estimation of emotions in speech,

M. Grimm, K. Kroschel, E. Mower, and S. Narayanan, “Primitives-based evaluation and estimation of emotions in speech,” Speech Communica- tion, vol. 49, no. 10-11, pp. 787–800, 2007

work page 2007

[39] [39]

A concordance correlation coefﬁcient to evaluate repro- ducibility,

L. I.-K. Lin, “A concordance correlation coefﬁcient to evaluate repro- ducibility,” Biometrics, pp. 255–268, 1989

work page 1989

[40] [40]

Avec 2016: Depression, mood, and emotion recognition workshop and challenge,

M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “Avec 2016: Depression, mood, and emotion recognition workshop and challenge,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 2016, pp. 3–10

work page 2016

[41] [41]

A VEC 2017: Real-life depression, and affect recognition workshop and challenge,

F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “A VEC 2017: Real-life depression, and affect recognition workshop and challenge,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, 2017, pp. 3–9

work page 2017

[42] [42]

Computational models of emotion inference in theory of mind: A review and roadmap,

D. C. Ong, J. Zaki, and N. D. Goodman, “Computational models of emotion inference in theory of mind: A review and roadmap,” Topics in Cognitive Science, vol. 11, no. 2, pp. 338–357, 2019

work page 2019

[43] [43]

Cue integration: A common framework for social cognition and physical perception,

J. Zaki, “Cue integration: A common framework for social cognition and physical perception,” Perspectives on Psychological Science, vol. 8, no. 3, pp. 296–312, 2013

work page 2013