pith. sign in

arxiv: 1907.03050 · v2 · pith:ZVHR2SRLnew · submitted 2019-07-05 · 💻 cs.LG · cs.HC· eess.AS· stat.ML

Jointly Aligning and Predicting Continuous Emotion Annotations

Pith reviewed 2026-05-25 01:55 UTC · model grok-4.3

classification 💻 cs.LG cs.HCeess.ASstat.ML
keywords continuous emotion recognitionspeech emotion analysislabel alignmentconvolutional neural networksinc filterRECOLA datasetSEWA datasetdimensional emotion prediction
0
0 comments X

The pith

A convolutional network learns time shifts via delayed sinc layers to jointly align speech signals with delayed continuous emotion labels and predict arousal and valence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the fact that continuous emotion annotations from human raters are delayed relative to the speech input because of reaction times. It introduces a multi-delay sinc network consisting of convolutional layers followed by an aligner that uses multiple delayed sinc layers to learn and compensate for these time-varying, non-stationary delays in an end-to-end fashion. The network is evaluated on the RECOLA and SEWA datasets and is shown to reach state-of-the-art speech-only performance on dimensional emotion descriptors. A sympathetic reader would care because accurate continuous emotion tracking from speech could support applications in mental health monitoring or human-computer interaction once the annotation delay problem is solved automatically rather than through manual synchronization.

Core claim

The multi-delay sinc network simultaneously aligns the speech signal and emotion labels by implementing an aligner network as a stack of delayed sinc layers, each a time-shifted low-pass sinc filter that learns a single delay through gradient descent; multiple such layers together handle non-stationary delays that depend on the acoustic space, while the overall model predicts dimensional descriptors of emotions.

What carries the argument

The delayed sinc layer: a convolutional layer that implements a time-shifted low-pass sinc filter and uses gradient-based optimization to learn one delay value per layer; stacking them compensates for variable reaction-time delays between speech and labels.

If this is right

  • The model reaches state-of-the-art speech-only results on both RECOLA and SEWA by learning time-varying delays.
  • End-to-end training becomes possible without a separate alignment preprocessing step.
  • Non-stationary delays that vary with acoustic content can be compensated inside the network.
  • Dimensional emotion descriptors can be predicted directly from aligned speech features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same delayed-sinc alignment mechanism could be tested on other time-series annotation tasks that suffer from human reaction delays, such as continuous affect in video or physiological signals.
  • If the learned delays prove stable across speakers or sessions, the approach might reduce the amount of manual synchronization required when collecting new emotion datasets.
  • Replacing the sinc-based aligner with other differentiable delay mechanisms could be compared directly on the same datasets to isolate the contribution of the low-pass filter property.

Load-bearing premise

The misalignment between the speech signal and the emotion labels can be adequately compensated by a stack of learned time shifts implemented via delayed sinc layers.

What would settle it

On RECOLA or SEWA, a version of the network with the aligner removed or with all delays fixed to zero fails to match or exceed the performance of the full model while still predicting arousal and valence.

Figures

Figures reproduced from arXiv: 1907.03050 by Emily Mower Provost, Melvin G McInnis, Soheil Khorram.

Figure 1
Figure 1. Figure 1: Applying delay to acoustic features improves mean CCC for both [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A visualization of our multi-delay sinc network with [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance reduction caused by applying a windowed-sinc filter [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of the delays trained through different runs of the MDS network with one cluster. This network learns one delay to compensate [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Increasing bandwidth of the sinc kernels up to 0.125Hz for [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: CCC results of the proposed network with different maximum [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: CCC results for different number of clusters. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network (multi-delay sinc network) that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the delayed sinc layer. It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a multi-delay sinc network, a CNN architecture with convolutional layers followed by an aligner network implemented via delayed sinc layers. Each delayed sinc layer is a time-shifted low-pass sinc filter that learns a single delay through gradient descent. Stacking multiple such layers is claimed to compensate for non-stationary, acoustic-space-dependent delays between speech signals and continuous emotion labels (arousal, valence). The system is evaluated on RECOLA and SEWA datasets and asserted to achieve state-of-the-art speech-only performance by jointly learning alignment and prediction in an end-to-end manner.

Significance. If the empirical claims hold and the alignment mechanism is shown to function as described, the work could meaningfully advance continuous emotion recognition by providing an integrated, differentiable solution to label misalignment arising from human reaction times. The delayed sinc layer introduces a novel, gradient-based time-shift operator that may have utility beyond this domain in other misaligned signal-processing tasks.

major comments (2)
  1. [Abstract / Aligner network] Abstract and aligner network description: The manuscript states that 'Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space.' However, each delayed sinc layer is defined to learn exactly one fixed delay via gradient descent on a time-shifted sinc filter. A convolutional stack therefore applies a fixed collection of constant delays uniformly across time and inputs, with no mechanism described for dynamic selection, per-timestep modulation, or content-dependent routing. This directly undermines the central claim that the architecture models non-stationary, acoustic-dependent misalignments.
  2. [Experimental evaluation] Experimental results: The abstract asserts state-of-the-art speech-only results on RECOLA and SEWA, yet supplies no numerical scores (e.g., CCC values), baseline comparisons, error bars, statistical tests, or protocol details (train/test splits, cross-validation). Without these, the performance claims cannot be verified and the efficacy of the joint alignment cannot be assessed.
minor comments (2)
  1. [Method] The mathematical formulation of the delayed sinc layer (filter impulse response, gradient computation for the delay parameter) should be provided explicitly with an equation to allow reproduction.
  2. [Results] Clarify whether the learned delays are visualized or analyzed post-training to confirm they capture reaction-time effects rather than arbitrary shifts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify key aspects of our work. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / Aligner network] Abstract and aligner network description: The manuscript states that 'Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space.' However, each delayed sinc layer is defined to learn exactly one fixed delay via gradient descent on a time-shifted sinc filter. A convolutional stack therefore applies a fixed collection of constant delays uniformly across time and inputs, with no mechanism described for dynamic selection, per-timestep modulation, or content-dependent routing. This directly undermines the central claim that the architecture models non-stationary, acoustic-dependent misalignments.

    Authors: We appreciate the referee pointing out this distinction. Each delayed sinc layer learns one fixed delay, resulting in a fixed collection of delays across the stack. Our original phrasing intended to convey that different learned delays can address misalignments that vary with acoustic space or conditions (i.e., different constant delays under different recording environments). However, we agree there is no mechanism for per-timestep or content-dependent adjustment. We will revise the abstract and aligner network section to state more precisely that multiple layers learn a set of fixed delays to handle varying but constant misalignments across acoustic spaces, removing the implication of explicitly non-stationary (time-varying) modeling. These wording changes will be made in the revised manuscript. revision: yes

  2. Referee: [Experimental evaluation] Experimental results: The abstract asserts state-of-the-art speech-only results on RECOLA and SEWA, yet supplies no numerical scores (e.g., CCC values), baseline comparisons, error bars, statistical tests, or protocol details (train/test splits, cross-validation). Without these, the performance claims cannot be verified and the efficacy of the joint alignment cannot be assessed.

    Authors: We agree that the abstract would be strengthened by including concrete performance metrics. The full manuscript contains the detailed results, including CCC values for arousal and valence, baseline comparisons, and evaluation protocols on RECOLA and SEWA. In the revision, we will update the abstract to summarize key numerical results (e.g., specific CCC scores), note the speech-only SOTA comparisons, and briefly reference the cross-validation protocol. This will make the claims verifiable from the abstract alone while retaining the full details in the experimental section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation on external datasets

full rationale

The paper introduces a multi-delay sinc network architecture for joint alignment and prediction of continuous emotion labels, with the central claims resting on empirical performance metrics obtained by training and evaluating on the public RECOLA and SEWA datasets. No load-bearing step reduces by construction to a fitted parameter renamed as a prediction, a self-citation chain, or a self-definitional relation; the delayed sinc layers are defined as trainable components whose learned delays are not presupposed in the reported results. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the modeling assumption that time-varying delays can be captured by gradient-learned shifts inside sinc filters; this modeling choice is introduced by the paper without external validation.

free parameters (1)
  • learned per-layer delays
    The time shifts are parameters optimized by gradient descent on the emotion-prediction loss.
axioms (1)
  • domain assumption A time shift between speech and continuous emotion labels can be modeled by a low-pass sinc filter whose delay is learned end-to-end
    This is the core premise that justifies the new delayed sinc layer.
invented entities (1)
  • delayed sinc layer no independent evidence
    purpose: To implement learnable time alignment inside a convolutional stack
    New component introduced by the paper; no independent evidence outside the reported experiments is mentioned.

pith-pipeline@v0.9.0 · 5745 in / 1352 out tokens · 28956 ms · 2026-05-25T01:55:17.247498+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 6 internal anchors

  1. [1]

    An argument for basic emotions,

    P . Ekman, “An argument for basic emotions,” Cognition & emotion, vol. 6, no. 3-4, pp. 169–200, 1992

  2. [2]

    Wild wild emotion: a multimodal ensemble approach,

    J. Gideon, B. Zhang, Z. Aldeneh, Y. Kim, S. Khorram, D. Le, and E. M. Provost, “Wild wild emotion: a multimodal ensemble approach,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 2016, pp. 501–505

  3. [3]

    Universals and cultural variations in 22 emotional ex- pressions across five cultures

    D. Cordaro, R. Sun, D. Keltner, S. Kamble, N. Huddar, and G. McNeil, “Universals and cultural variations in 22 emotional ex- pressions across five cultures.” Emotion (Washington, DC), vol. 18, no. 1, pp. 75–93, 2018

  4. [4]

    Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,

    B. W. Schuller, “Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, 2018

  5. [5]

    Continuous estimation of emotions in speech by dynamic cooperative speaker models,

    A. Mencattini, E. Martinelli, F. Ringeval, B. Schuller, and C. Di Na- tale, “Continuous estimation of emotions in speech by dynamic cooperative speaker models,” IEEE transactions on affective comput- ing, vol. 8, no. 3, pp. 314–327, 2017

  6. [6]

    Av+ec 2015: the first affect recognition challenge bridging across audio, video, and physiological data,

    F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic, “Av+ec 2015: the first affect recognition challenge bridging across audio, video, and physiological data,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 3–8

  7. [7]

    Pool- ing acoustic and lexical features for the prediction of valence,

    Z. Aldeneh, S. Khorram, D. Dimitriadis, and E. M. Provost, “Pool- ing acoustic and lexical features for the prediction of valence,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 2017, pp. 68–72

  8. [8]

    End-to-end learning for dimensional emotion recognition from physiological signals,

    G. Keren, T. Kirschstein, E. Marchi, F. Ringeval, and B. Schuller, “End-to-end learning for dimensional emotion recognition from physiological signals,” in Multimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 2017, pp. 985–990

  9. [9]

    Study of dense network ap- proaches for speech emotion recognition,

    M. Abdelwahab and C. Busso, “Study of dense network ap- proaches for speech emotion recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, AB, Canada, 2018

  10. [10]

    Ladder networks for emotion recognition: Using unsupervised auxiliary tasks to improve pre- dictions of emotional attributes,

    S. Parthasarathy and C. Busso, “Ladder networks for emotion recognition: Using unsupervised auxiliary tasks to improve pre- dictions of emotional attributes,” Proc. Interspeech 2018, pp. 3698– 3702, 2018

  11. [11]

    Exploiting acoustic and lexical properties of phonemes to recognize valence from speech,

    B. Zhang, S. Khorram, and E. M. Provost, “Exploiting acoustic and lexical properties of phonemes to recognize valence from speech,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5871–5875

  12. [12]

    End-to-end multimodal emotion recognition using deep neural networks,

    P . Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017

  13. [13]

    Summary for avec 2017: real-life depression and affect challenge and workshop,

    F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, and M. Pantic, “Summary for avec 2017: real-life depression and affect challenge and workshop,” in Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017, pp. 1963–1964

  14. [14]

    Learning representations of emotional speech with deep convolutional generative adversarial networks,

    J. Chang and S. Scherer, “Learning representations of emotional speech with deep convolutional generative adversarial networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 2746–2750

  15. [15]

    Toward effective automatic recognition systems of emotion in speech,

    C. Busso, M. Bulut, S. Narayanan, J. Gratch, and S. Marsella, “Toward effective automatic recognition systems of emotion in speech,” Social emotions in nature and artifact: emotions in human and human-computer interaction, J. Gratch and S. Marsella, Eds , pp. 110–127, 2013

  16. [16]

    Emotion recognition in human- computer interaction,

    R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human- computer interaction,” IEEE Signal processing magazine , vol. 18, no. 1, pp. 32–80, 2001

  17. [17]

    Describing the emotional states that are expressed in speech,

    R. Cowie and R. R. Cornelius, “Describing the emotional states that are expressed in speech,” Speech communication , vol. 40, no. 1-2, pp. 5–32, 2003

  18. [18]

    Jointly predicting arousal, valence and dominance with multi-task learning,

    S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning,” INTERSPEECH, Stock- holm, Sweden, 2017

  19. [19]

    Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences,

    B. Zhang, E. M. Provost, and G. Essl, “Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences,” IEEE Transactions on Affec- tive Computing, 2017

  20. [20]

    Prediction-based learning for continuous emotion recognition in speech,

    J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Prediction-based learning for continuous emotion recognition in speech,” in Acous- 14 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING tics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5005–5009

  21. [21]

    Continuous multimodal emotion prediction based on long short term memory recurrent neural network,

    J. Huang, Y. Li, J. Tao, Z. Lian, Z. Wen, M. Yang, and J. Yi, “Continuous multimodal emotion prediction based on long short term memory recurrent neural network,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . ACM, 2017, pp. 11–18

  22. [22]

    An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction,

    Z. Huang, T. Dang, N. Cummins, B. Stasak, P . Le, V . Sethu, and J. Epps, “An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 41–48

  23. [23]

    Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,

    S. Mariooryad and C. Busso, “Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,” IEEE Transactions on Affective Computing, vol. 6, no. 2, pp. 97–108, 2015

  24. [24]

    Robust continuous prediction of human emotions using multi- scale dynamic cues,

    J. Nicolle, V . Rapp, K. Bailly, L. Prevost, and M. Chetouani, “Robust continuous prediction of human emotions using multi- scale dynamic cues,” in Proceedings of the 14th ACM international conference on Multimodal interaction. ACM, 2012, pp. 501–508

  25. [25]

    A history of experimental psychology,

    E. B ¨oring, “A history of experimental psychology,” New York: Appleton-Century, 1950

  26. [26]

    Errors of judgement at greenwich in 1796

    J. D. Mollon and A. J. Perkins, “Errors of judgement at greenwich in 1796.” Nature, 1996

  27. [27]

    On the speed of different senses and nerve transmis- sion by hirsch (1862),

    S. Nicolas, “On the speed of different senses and nerve transmis- sion by hirsch (1862),” Psychological Research , vol. 59, no. 4, pp. 261–268, 1997

  28. [28]

    Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations,

    S. Mariooryad and C. Busso, “Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations,” in Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 2013, pp. 85–90

  29. [29]

    Automatic segmenta- tion of spontaneous data using dimensional labels from multiple coders,

    M. A. Nicolaou, H. Gunes, and M. Pantic, “Automatic segmenta- tion of spontaneous data using dimensional labels from multiple coders,” in Proc. of LREC Int. Workshop on Multimodal Corpora: Ad- vances in Capturing, Coding and Analyzing Multimodality . Citeseer, 2010, pp. 43–48

  30. [30]

    Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,

    G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nico- laou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5200–5204

  31. [31]

    Continuous predic- tion of spontaneous affect from multiple cues and modalities in valence-arousal space,

    M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous predic- tion of spontaneous affect from multiple cues and modalities in valence-arousal space,” IEEE Transactions on Affective Computing , vol. 2, no. 2, pp. 92–105, 2011

  32. [32]

    Capturing long-term temporal dependencies with con- volutional networks for continuous emotion recognition,

    S. Khorram, Z. Aldeneh, D. Dimitriadis, M. McInnis, and E. M. Provost, “Capturing long-term temporal dependencies with con- volutional networks for continuous emotion recognition,” Proc. Interspeech 2017, pp. 1253–1257, 2017

  33. [33]

    Prediction of asynchronous dimen- sional emotion ratings from audiovisual and physiological data,

    F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-P . Thiran, T. Ebrahimi, D. Lalanne, and B. Schuller, “Prediction of asynchronous dimen- sional emotion ratings from audiovisual and physiological data,” Pattern Recognition Letters, vol. 66, pp. 22–30, 2015

  34. [34]

    Discretized continuous speech emotion recognition with multi-task deep recurrent neural network

    D. Le, Z. Aldeneh, and E. M. Provost, “Discretized continuous speech emotion recognition with multi-task deep recurrent neural network.” in INTERSPEECH, 2017, pp. 1108–1112

  35. [35]

    Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions,

    F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions,” in International Conference and Workshops on Automatic Face and Gesture Recognition (FG) . IEEE, 2013, pp. 1–8

  36. [36]

    Avec 2017: real-life depression, and affect recognition workshop and chal- lenge,

    F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “Avec 2017: real-life depression, and affect recognition workshop and chal- lenge,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017, pp. 3–9

  37. [37]

    Multimodal multi-task learning for dimensional and continuous emotion recognition,

    S. Chen, Q. Jin, J. Zhao, and S. Wang, “Multimodal multi-task learning for dimensional and continuous emotion recognition,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017, pp. 19–26

  38. [38]

    Long short-term memory recurrent neural network architectures for large scale acoustic modeling,

    H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014

  39. [39]

    Multi-Scale Context Aggregation by Dilated Convolutions

    F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015

  40. [40]

    Learning deconvolution network for semantic segmentation,

    H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015, pp. 1520–1528

  41. [41]

    Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion,

    V . Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion,” IEEE transactions on pattern analysis and machine intelligence , vol. 39, no. 12, pp. 2481–2495, 2017

  42. [42]

    Avec 2018 workshop and challenge: bipolar disorder and cross-cultural affect recognition,

    F. Ringeval, B. Schuller, M. Valstar, R. Cowie, H. Kaya, M. Schmitt, S. Amiriparian, N. Cummins, D. Lalanne, A. Michaud et al., “Avec 2018 workshop and challenge: bipolar disorder and cross-cultural affect recognition,” in Proceedings of the 2018 on Audio/Visual Emo- tion Challenge and Workshop. ACM, 2018, pp. 3–13

  43. [43]

    Towards a better gold standard: denoising and modelling continuous emotion annota- tions based on feature agglomeration and outlier regularisation,

    C. Wang, P . Lopes, T. Pun, and G. Chanel, “Towards a better gold standard: denoising and modelling continuous emotion annota- tions based on feature agglomeration and outlier regularisation,” in Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 2018, pp. 73–81

  44. [44]

    Speech-based continuous emotion prediction by learning perception responses related to salient events: a study based on vocal affect bursts and cross-cultural affect in avec 2018,

    K. Wataraka Gamage, T. Dang, V . Sethu, J. Epps, and E. Ambikaira- jah, “Speech-based continuous emotion prediction by learning perception responses related to salient events: a study based on vocal affect bursts and cross-cultural affect in avec 2018,” in Pro- ceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 2018, pp. 47–55

  45. [45]

    Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,

    J. Zhao, R. Li, S. Chen, and Q. Jin, “Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,” in Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 2018, pp. 65–72

  46. [46]

    Multimodal affective dimension prediction using deep bidirectional long short- term memory recurrent neural networks,

    L. He, D. Jiang, L. Yang, E. Pei, P . Wu, and H. Sahli, “Multimodal affective dimension prediction using deep bidirectional long short- term memory recurrent neural networks,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge . ACM, 2015, pp. 73–80

  47. [47]

    Recent develop- ments in opensmile, the munich open-source multimedia feature extractor,

    F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent develop- ments in opensmile, the munich open-source multimedia feature extractor,” in Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013, pp. 835–838

  48. [48]

    Yaafe, an easy to use and efficient audio feature extraction software

    B. Mathieu, S. Essid, T. Fillon, J. Prado, and G. Richard, “Yaafe, an easy to use and efficient audio feature extraction software.” in ISMIR, 2010, pp. 441–446

  49. [49]

    Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,

    K. Brady, Y. Gwon, P . Khorrami, E. Godoy, W. Campbell, C. Dagli, and T. S. Huang, “Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . ACM, 2016, pp. 97–104

  50. [50]

    Multimodal emotion recognition for AVEC 2016 challenge,

    F. Povolny, P . Matejka, M. Hradis, A. Popkov ´a, L. Otrusina, P . Smrz, I. Wood, C. Robin, and L. Lamel, “Multimodal emotion recognition for AVEC 2016 challenge,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . ACM, 2016, pp. 75–82

  51. [51]

    The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,

    F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y. Devillers, J. Epps, P . Laukka, S. S. Narayanan et al., “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016

  52. [52]

    Decon- volutional networks,

    M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Decon- volutional networks,” in Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2010, pp. 2528–2535

  53. [53]

    A guide to convolution arithmetic for deep learning

    V . Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016

  54. [54]

    Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge,

    B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge,” Speech Communication, vol. 53, no. 9-10, pp. 1062–1087, 2011

  55. [55]

    Soundnet: learning sound representations from unlabeled video,

    Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: learning sound representations from unlabeled video,” in Advances in Neural In- formation Processing Systems, 2016, pp. 892–900

  56. [56]

    AVEC 2016: depression, mood, and emotion recognition workshop and challenge,

    M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “AVEC 2016: depression, mood, and emotion recognition workshop and challenge,” in Proceedings of the 6th International Workshop on Au- dio/Visual Emotion Challenge. ACM, 2016, pp. 3–10

  57. [57]

    Openear-introducing the munich open-source emotion and affect recognition toolkit,

    F. Eyben, M. W ¨ollmer, and B. Schuller, “Openear-introducing the munich open-source emotion and affect recognition toolkit,” in Affective computing and intelligent interaction and workshops, 2009. ACII 2009. 3rd international conference on . IEEE, 2009, pp. 1–6

  58. [58]

    Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques

    L. Muda, M. Begam, and I. Elamvazuthi, “Voice recogni- tion algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques,” arXiv preprint arXiv:1003.4083, 2010. KHORRAM et al.: JOINTL Y ALIGNING AND PREDICTING CONTINUOUS EMOTION ANNOTATIONS 15

  59. [59]

    Recog- nition of depression in bipolar disorder: Leveraging cohort and person-specific knowledge

    S. Khorram, J. Gideon, M. G. McInnis, and E. M. Provost, “Recog- nition of depression in bipolar disorder: Leveraging cohort and person-specific knowledge.” in INTERSPEECH, 2016, pp. 1215– 1219

  60. [60]

    Mel-generalized cepstral analysis-a unified approach to speech spectral estima- tion,

    K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalized cepstral analysis-a unified approach to speech spectral estima- tion,” in Third International Conference on Spoken Language Process- ing, 1994

  61. [61]

    Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthe- sis,

    S. Khorram, H. Sameti, F. Bahmaninezhad, S. King, and T. Drug- man, “Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthe- sis,” EURASIP Journal on Audio, Speech, and Music Processing , vol. 2014, no. 1, p. 12, 2014

  62. [62]

    Soft context clustering for f0 modeling in hmm-based speech synthesis,

    S. Khorram, H. Sameti, and S. King, “Soft context clustering for f0 modeling in hmm-based speech synthesis,” EURASIP Journal on Advances in Signal Processing, vol. 2015, no. 1, p. 2, 2015

  63. [63]

    Speech synthesis based on gaussian conditional random fields,

    S. Khorram, F. Bahmaninezhad, and H. Sameti, “Speech synthesis based on gaussian conditional random fields,” in International Symposium on Artificial Intelligence and Signal Processing. Springer, 2013, pp. 183–193

  64. [64]

    Using neutral speech models for emotional speech analysis

    C. Busso, S. Lee, and S. S. Narayanan, “Using neutral speech models for emotional speech analysis.” in Interspeech, 2007, pp. 2225–2228

  65. [65]

    The priori emotion dataset: Linking mood to emotion detected in-the-wild,

    S. Khorram, M. Jaiswal, J. Gideon, M. McInnis, and E.-M. Provost, “The priori emotion dataset: Linking mood to emotion detected in-the-wild,” Proc. Interspeech 2018, pp. 1903–1907, 2018

  66. [66]

    The Kaldi speech recognition toolkit,

    D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motlicek, Y. Qian, P . Schwarz et al. , “The Kaldi speech recognition toolkit,” in workshop on automatic speech recognition and understanding (ASRU). IEEE, 2011

  67. [67]

    Emotion recognition from sponta- neous speech using hidden markov models with deep belief networks,

    D. Le and E. M. Provost, “Emotion recognition from sponta- neous speech using hidden markov models with deep belief networks,” in workshop on automatic speech recognition and under- standing (ASRU). IEEE, 2013, pp. 216–221

  68. [68]

    Adam: A Method for Stochastic Optimization

    D. Kingma and J. Ba, “Adam: a method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980, 2014

  69. [69]

    S. K. Mitra and Y. Kuo, Digital signal processing: a computer-based approach. McGraw-Hill Higher Education, 2006, vol. 2

  70. [70]

    Trainable Time Warping: Aligning Time-Series in the Continuous-Time Domain

    S. Khorram, M. G. McInnis, and E. M. Provost, “Trainable time warping: Aligning time-series in the continuous-time domain,” arXiv preprint arXiv:1903.09245, 2019

  71. [71]

    Automatic detection of laugh- ter,

    K. Truong and D. van Leeuwen, “Automatic detection of laugh- ter,” in 9th European Conference on Speech Communication and Tech- nology, 4 September 2005 through 8 September 2005, Lisbon,, 485-488 , 2005

  72. [72]

    Acoustic analysis of laughter,

    C. A. Bickley and S. Hunnicutt, “Acoustic analysis of laughter,” in Second International Conference on Spoken Language Processing , 1992

  73. [73]

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

    M. Abadi, A. Agarwal, P . Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorflow: large-scale machine learning on heterogeneous distributed sys- tems,” arXiv preprint arXiv:1603.04467, 2016

  74. [74]

    Progressive neural networks for transfer learning in emotion recognition,

    J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, and E. M. Provost, “Progressive neural networks for transfer learning in emotion recognition,” Proc. Interspeech 2017, pp. 1098–1102, 2017

  75. [75]

    Investigating word affect features and fusion of probabilistic predictions incor- porating uncertainty in avec 2017,

    T. Dang, B. Stasak, Z. Huang, S. Jayawardena, M. Atcheson, M. Hayat, P . Le, V . Sethu, R. Goecke, and J. Epps, “Investigating word affect features and fusion of probabilistic predictions incor- porating uncertainty in avec 2017,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . ACM, 2017, pp. 27– 35

  76. [76]

    Track- ing changes in continuous emotion states using body language and prosodic cues,

    A. Metallinou, A. Katsamanis, Y. Wang, and S. Narayanan, “Track- ing changes in continuous emotion states using body language and prosodic cues,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on . IEEE, 2011, pp. 2288–2291

  77. [77]

    Age and sex differences in reaction time in adulthood: results from the united kingdom health and lifestyle survey

    G. Der and I. J. Deary, “Age and sex differences in reaction time in adulthood: results from the united kingdom health and lifestyle survey.” Psychology and aging, vol. 21, no. 1, p. 62, 2006

  78. [78]

    Concentration analysis: a quantitative assessment of student states,

    L. Bao and E. F. Redish, “Concentration analysis: a quantitative assessment of student states,” American Journal of Physics , vol. 69, no. S1, pp. S45–S53, 2001

  79. [79]

    Video-based emotion recognition in the wild using deep transfer learning and score fusion,

    H. Kaya, F. G ¨urpınar, and A. A. Salah, “Video-based emotion recognition in the wild using deep transfer learning and score fusion,” Image and Vision Computing, vol. 65, pp. 66–75, 2017

  80. [80]

    Audio-visual emotion recognition in video clips,

    F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. An- barjafari, “Audio-visual emotion recognition in video clips,” IEEE Transactions on Affective Computing, 2017

Showing first 80 references.