pith. sign in

arxiv: 1907.04197 · v1 · pith:CAZ52ZL5new · submitted 2019-07-08 · 💻 cs.LG · cs.CL· stat.ML

Attending to Emotional Narratives

Pith reviewed 2026-05-25 01:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML
keywords attention mechanismsemotion recognitionTransformerMemory Fusion Networkmultimodal time-seriesautobiographical narrativesemotional valenceaffective computing
0
0 comments X

The pith

Attention mechanisms like the Transformer generalize to multimodal emotion recognition on autobiographical narratives, sometimes matching human raters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that attention-based mechanisms originally developed for sequence prediction tasks, specifically the Transformer's parallelizable self-attention and the Memory Fusion Network's cross-modal and temporal attention, also apply effectively to predicting emotional valence from multimodal time-series data. It demonstrates this by adapting the models to a dataset of emotional autobiographical narratives. A sympathetic reader would care because successful generalization would mean these architectures can capture how emotions unfold over time across different input channels without needing entirely new designs. This points toward more flexible tools for tracking emotional states in narrative data.

Core claim

The central claim is that the Transformer with its parallelizable self-attention layers and the Memory Fusion Network with attention across modalities and time generalize well to multimodal time-series emotion recognition. Using a recently-introduced dataset of emotional autobiographical narratives, the models predict emotional valence over time and in some cases reach performance comparable with human raters. The paper ends by discussing the implications of attention mechanisms for affective computing.

What carries the argument

The Transformer model using parallelizable self-attention layers for sequence processing and the Memory Fusion Network using attention across modalities and time for multimodal fusion, both applied to valence prediction.

If this is right

  • The attention mechanisms achieve high performance when predicting emotional valence over time in the narratives.
  • In some cases the models reach performance levels comparable to human raters.
  • The same architectures apply across sequence prediction and emotion recognition without major redesign.
  • Attention mechanisms carry implications for building systems in affective computing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results raise the question of whether the same attention layers would transfer to other time-series emotion tasks such as speech or video without further changes.
  • One could examine whether the parallel self-attention specifically helps capture long-range emotional arcs in longer narratives.
  • The approach might extend to real-time applications where emotional valence must be tracked from ongoing multimodal input streams.

Load-bearing premise

The models can be adapted to the multimodal time-series emotion recognition task on the autobiographical narratives dataset in a way that allows valid comparison to human raters.

What would settle it

Retraining the Transformer and Memory Fusion Network on the emotional autobiographical narratives dataset and measuring their output against human valence ratings, where the models fail to reach comparable agreement levels.

Figures

Figures reproduced from arXiv: 1907.04197 by Desmond C. Ong, Jamil Zaki, Tan Zhi-Xuan, Xiyu Zhang, Zhengxuan Wu.

Figure 1
Figure 1. Figure 1: Diagramatic overview of our two modelling approaches. (a) Simple Fusion Transformer. (b) Memory Fusion Transformer. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schematic of the basic Transformer architecture [20] we employed. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sample of the best-performing non-unimodal model predictions (SFT: [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Out-of-sample emotion valence prediction of the speech [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Attention mechanisms in deep neural networks have achieved excellent performance on sequence-prediction tasks. Here, we show that these recently-proposed attention-based mechanisms---in particular, the Transformer with its parallelizable self-attention layers, and the Memory Fusion Network with attention across modalities and time---also generalize well to multimodal time-series emotion recognition. Using a recently-introduced dataset of emotional autobiographical narratives, we adapt and apply these two attention mechanisms to predict emotional valence over time. Our models perform extremely well, in some cases reaching a performance comparable with human raters. We end with a discussion of the implications of attention mechanisms to affective computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript adapts two attention-based architectures—the Transformer (with self-attention) and the Memory Fusion Network (with cross-modal and temporal attention)—to the task of predicting continuous emotional valence from multimodal time-series data in a dataset of autobiographical narratives. The authors report that both models achieve strong performance, in some cases approaching human-rater levels, and discuss implications for affective computing.

Significance. If the reported performance numbers and human comparisons are reproducible, the work supplies concrete evidence that recently developed attention mechanisms transfer to multimodal affective time-series tasks without task-specific redesign, which is a useful data point for the affective-computing community.

minor comments (3)
  1. §3.2 and Table 2: the description of the MFN temporal attention module should explicitly state the dimensionality of the memory bank and the number of attention heads used in the experiments; these hyperparameters are referenced in the results but not defined in the model section.
  2. Figure 3: the y-axis label on the valence-prediction plots should indicate the exact range and scaling (e.g., [-1,1] or [0,1]) to allow direct comparison with the human-rater baselines shown in the same figure.
  3. §4.3: the sentence claiming “performance comparable with human raters” should be qualified by the specific metric (CCC or MSE) and the exact human inter-rater value; the current phrasing is slightly stronger than the numbers in Table 3 support.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments appear in the report, so we have no individual points to address.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The manuscript applies existing attention models (Transformer, Memory Fusion Network) to an emotion-recognition task on a new dataset and reports empirical results. No equations, derivations, or fitted quantities are introduced that could reduce to inputs by construction. Claims rest on standard model adaptation and performance metrics rather than any self-definitional, self-citation load-bearing, or renaming steps. The paper is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are identifiable. The work relies on standard attention mechanisms from prior literature without introducing new ones.

pith-pipeline@v0.9.0 · 5638 in / 1124 out tokens · 30503 ms · 2026-05-25T01:33:18.165585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 5 internal anchors

  1. [1]

    Empathy: Its ultimate and proximate bases,

    S. D. Preston and F. B. De Waal, “Empathy: Its ultimate and proximate bases,” Behavioral and Brain Sciences , vol. 25, no. 1, pp. 1–20, 2002

  2. [2]

    Empathy and well-being correlate with centrality in different social networks,

    S. A. Morelli, D. C. Ong, R. Makati, M. O. Jackson, and J. Zaki, “Empathy and well-being correlate with centrality in different social networks,” Proceedings of the National Academy of Sciences , vol. 114, no. 37, pp. 9843–9847, 2017

  3. [3]

    Affective cognition: Exploring lay theories of emotion,

    D. C. Ong, J. Zaki, and N. D. Goodman, “Affective cognition: Exploring lay theories of emotion,” Cognition, vol. 143, pp. 141–162, 2015

  4. [4]

    A survey of affect recognition methods: Audio, visual, and spontaneous expressions,

    Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 31, no. 1, pp. 39–58, 2009

  5. [5]

    A review of affective computing: From unimodal analysis to multimodal fusion,

    S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information Fusion, vol. 37, pp. 98–125, 2017

  6. [6]

    Emotion recognition in the wild via convo- lutional neural networks and mapped binary patterns,

    G. Levi and T. Hassner, “Emotion recognition in the wild via convo- lutional neural networks and mapped binary patterns,” in Proceedings of the 2015 ACM International Conference on Multimodal Interaction . ACM, 2015, pp. 503–510

  7. [7]

    Deep convolutional neural networks for sentiment analysis of short texts,

    C. Dos Santos and M. Gatti, “Deep convolutional neural networks for sentiment analysis of short texts,” in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 69–78

  8. [8]

    Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space,

    M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space,” IEEE Transactions on Affective Computing , vol. 2, no. 2, pp. 92–105, 2011

  9. [9]

    Modeling emotion in complex stories: the Stanford Emotional Narratives Dataset,

    D. C. Ong, Z. Wu, T. Zhi-Xuan, M. Reddan, I. Kahhale, A. Mattek, and J. Zaki, “Modeling emotion in complex stories: the Stanford Emotional Narratives Dataset,” Invited Revision to Journal

  10. [10]

    A learning algorithm for continually running fully recurrent neural networks,

    R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation , vol. 1, no. 2, pp. 270–280, 1989

  11. [11]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

  12. [12]

    Recur- rent neural networks for emotion recognition in video,

    S. E. Kahou, V . Michalski, K. Konda, R. Memisevic, and C. Pal, “Recur- rent neural networks for emotion recognition in video,” in Proceedings of the ACM International Conference on Multimodal Interaction , 2015, pp. 467–474

  13. [13]

    How deep neural networks can improve emotion recognition on video data,

    P. Khorrami, T. Le Paine, K. Brady, C. Dagli, and T. S. Huang, “How deep neural networks can improve emotion recognition on video data,” in 2016 IEEE International Conference on Image Processing (ICIP) . IEEE, 2016, pp. 619–623

  14. [14]

    Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies,

    M. W ¨ollmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas- Cowie, and R. Cowie, “Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies,” in Proceedings Interspeech, 2008, pp. 597–600

  15. [15]

    On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues,

    F. Eyben, M. W ¨ollmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie, “On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues,” Journal on Multimodal User Interfaces, vol. 3, no. 1-2, pp. 7–19, 2010

  16. [16]

    Lstm- modeling of continuous emotions in an audiovisual affect recognition framework,

    M. W ¨ollmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll, “Lstm- modeling of continuous emotions in an audiovisual affect recognition framework,” Image and Vision Computing , vol. 31, no. 2, pp. 153–163, 2013

  17. [17]

    A multimodal lstm for predicting listener empathic responses over time,

    Z. X. Tan, A. Goel, T.-S. Nguyen, and D. C. Ong, “A multimodal lstm for predicting listener empathic responses over time,” in OMG- Empathy Challenge workshop at the 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG) , 2019

  18. [18]

    Effective approaches to attention-based neural machine translation,

    T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Empirical Methods in Natural Language Processing (EMNLP) , 2015, pp. 1412–1421

  19. [19]

    Neural machine translation by jointly learning to align and translate,

    D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in Proceedings of the 4th International Conference on Learning Representations (ICLR) , 2015

  20. [20]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems , 2017, pp. 5998–6008

  21. [21]

    QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

    A. W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V . Le, “Qanet: Combining local convolution with global self-attention for reading comprehension,” arXiv preprint arXiv:1804.09541 , 2018

  22. [22]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

  23. [23]

    Non-Autoregressive Neural Machine Translation

    J. Gu, J. Bradbury, C. Xiong, V . O. Li, and R. Socher, “Non-autoregressive neural machine translation,” arXiv preprint arXiv:1711.02281, 2017

  24. [24]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” Technical re- port, OpenAi, Tech. Rep., 2018

  25. [25]

    Automatic speech emotion recognition using recurrent neural networks with local attention,

    S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 2227–2231

  26. [26]

    Memory fusion network for multi-view sequential learning,

    A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L.-P. Morency, “Memory fusion network for multi-view sequential learning,” in Thirty-Second AAAI Conference on Artificial Intelligence , 2018

  27. [27]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014

  28. [28]

    The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,

    F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. P. Truong, “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016

  29. [29]

    Recent developments in openSMILE, the Munich open-source multimedia feature extractor,

    F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in openSMILE, the Munich open-source multimedia feature extractor,” in Proceedings of the 21st ACM International Conference on Multime- dia, 2013, pp. 835–838

  30. [30]

    Glove: Global vectors for word representation,

    J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543

  31. [31]

    Affect recognition from face and body: early fusion vs. late fusion,

    H. Gunes and M. Piccardi, “Affect recognition from face and body: early fusion vs. late fusion,” in IEEE International Conference on Systems, Man and Cybernetics , vol. 4, 2005, pp. 3437–3443

  32. [32]

    Early versus late fusion in semantic video analysis,

    C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late fusion in semantic video analysis,” in Proceedings of the 13th Annual ACM International Conference on Multimedia , 2005, pp. 399–402

  33. [33]

    Long short term memory recurrent neural network based multimodal dimensional emo- tion recognition,

    L. Chao, J. Tao, M. Yang, Y . Li, and Z. Wen, “Long short term memory recurrent neural network based multimodal dimensional emo- tion recognition,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 65–72

  34. [34]

    Multi-modal dimensional emotion recognition using recurrent neural networks,

    S. Chen and Q. Jin, “Multi-modal dimensional emotion recognition using recurrent neural networks,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 49–56

  35. [35]

    Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,

    K. Brady, Y . Gwon, P. Khorrami, E. Godoy, W. Campbell, C. Dagli, and T. S. Huang, “Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge , 2016, pp. 97–104

  36. [36]

    Highway Networks

    R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387 , 2015

  37. [37]

    Computational affective cognition: Modeling reasoning about emotions,

    D. C. Ong, “Computational affective cognition: Modeling reasoning about emotions,” Ph.D. dissertation, Stanford University, 2017

  38. [38]

    Primitives-based evaluation and estimation of emotions in speech,

    M. Grimm, K. Kroschel, E. Mower, and S. Narayanan, “Primitives-based evaluation and estimation of emotions in speech,” Speech Communica- tion, vol. 49, no. 10-11, pp. 787–800, 2007

  39. [39]

    A concordance correlation coefficient to evaluate repro- ducibility,

    L. I.-K. Lin, “A concordance correlation coefficient to evaluate repro- ducibility,” Biometrics, pp. 255–268, 1989

  40. [40]

    Avec 2016: Depression, mood, and emotion recognition workshop and challenge,

    M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “Avec 2016: Depression, mood, and emotion recognition workshop and challenge,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 2016, pp. 3–10

  41. [41]

    A VEC 2017: Real-life depression, and affect recognition workshop and challenge,

    F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “A VEC 2017: Real-life depression, and affect recognition workshop and challenge,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, 2017, pp. 3–9

  42. [42]

    Computational models of emotion inference in theory of mind: A review and roadmap,

    D. C. Ong, J. Zaki, and N. D. Goodman, “Computational models of emotion inference in theory of mind: A review and roadmap,” Topics in Cognitive Science, vol. 11, no. 2, pp. 338–357, 2019

  43. [43]

    Cue integration: A common framework for social cognition and physical perception,

    J. Zaki, “Cue integration: A common framework for social cognition and physical perception,” Perspectives on Psychological Science, vol. 8, no. 3, pp. 296–312, 2013