Jointly Aligning and Predicting Continuous Emotion Annotations

Emily Mower Provost; Melvin G McInnis; Soheil Khorram

arxiv: 1907.03050 · v2 · pith:ZVHR2SRLnew · submitted 2019-07-05 · 💻 cs.LG · cs.HC· eess.AS· stat.ML

Jointly Aligning and Predicting Continuous Emotion Annotations

Soheil Khorram , Melvin G McInnis , Emily Mower Provost This is my paper

Pith reviewed 2026-05-25 01:55 UTC · model grok-4.3

classification 💻 cs.LG cs.HCeess.ASstat.ML

keywords continuous emotion recognitionspeech emotion analysislabel alignmentconvolutional neural networksinc filterRECOLA datasetSEWA datasetdimensional emotion prediction

0 comments

The pith

A convolutional network learns time shifts via delayed sinc layers to jointly align speech signals with delayed continuous emotion labels and predict arousal and valence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the fact that continuous emotion annotations from human raters are delayed relative to the speech input because of reaction times. It introduces a multi-delay sinc network consisting of convolutional layers followed by an aligner that uses multiple delayed sinc layers to learn and compensate for these time-varying, non-stationary delays in an end-to-end fashion. The network is evaluated on the RECOLA and SEWA datasets and is shown to reach state-of-the-art speech-only performance on dimensional emotion descriptors. A sympathetic reader would care because accurate continuous emotion tracking from speech could support applications in mental health monitoring or human-computer interaction once the annotation delay problem is solved automatically rather than through manual synchronization.

Core claim

The multi-delay sinc network simultaneously aligns the speech signal and emotion labels by implementing an aligner network as a stack of delayed sinc layers, each a time-shifted low-pass sinc filter that learns a single delay through gradient descent; multiple such layers together handle non-stationary delays that depend on the acoustic space, while the overall model predicts dimensional descriptors of emotions.

What carries the argument

The delayed sinc layer: a convolutional layer that implements a time-shifted low-pass sinc filter and uses gradient-based optimization to learn one delay value per layer; stacking them compensates for variable reaction-time delays between speech and labels.

If this is right

The model reaches state-of-the-art speech-only results on both RECOLA and SEWA by learning time-varying delays.
End-to-end training becomes possible without a separate alignment preprocessing step.
Non-stationary delays that vary with acoustic content can be compensated inside the network.
Dimensional emotion descriptors can be predicted directly from aligned speech features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same delayed-sinc alignment mechanism could be tested on other time-series annotation tasks that suffer from human reaction delays, such as continuous affect in video or physiological signals.
If the learned delays prove stable across speakers or sessions, the approach might reduce the amount of manual synchronization required when collecting new emotion datasets.
Replacing the sinc-based aligner with other differentiable delay mechanisms could be compared directly on the same datasets to isolate the contribution of the low-pass filter property.

Load-bearing premise

The misalignment between the speech signal and the emotion labels can be adequately compensated by a stack of learned time shifts implemented via delayed sinc layers.

What would settle it

On RECOLA or SEWA, a version of the network with the aligner removed or with all delays fixed to zero fails to match or exceed the performance of the full model while still predicting arousal and valence.

Figures

Figures reproduced from arXiv: 1907.03050 by Emily Mower Provost, Melvin G McInnis, Soheil Khorram.

**Figure 2.** Figure 2: A visualization of our multi-delay sinc network with [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Performance reduction caused by applying a windowed-sinc filter [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of the delays trained through different runs of the MDS network with one cluster. This network learns one delay to compensate [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Increasing bandwidth of the sinc kernels up to 0.125Hz for [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 7.** Figure 7: CCC results of the proposed network with different maximum [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 6.** Figure 6: CCC results for different number of clusters. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network (multi-delay sinc network) that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the delayed sinc layer. It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The delayed sinc layer is a straightforward new primitive for learning fixed alignment shifts inside a conv net, but the stack does not obviously produce acoustic-content-dependent or time-varying delays.

read the letter

The paper introduces a delayed sinc layer inside a convolutional stack that learns a single scalar delay per layer via gradient descent on a time-shifted sinc filter. The aligner network then uses these layers to shift the speech features so they better match the delayed emotion labels, all trained end-to-end with the prediction head. This is the concrete novelty: treating alignment as a differentiable operation rather than a separate preprocessing step. On the positive side, the approach directly targets a known practical headache in continuous emotion work, and evaluating on RECOLA and SEWA keeps the experiments grounded in standard benchmarks for the area. The end-to-end framing is clean and could reduce pipeline complexity for people already using conv nets on speech features. The main limitation is the handling of non-stationary delays. The abstract states that multiple layers compensate for delays that are a function of the acoustic space, yet each layer learns only one fixed delay. A stack therefore supplies a fixed collection of shifts applied uniformly, with no mechanism described for content-dependent routing or per-timestep modulation. Without that, the model cannot represent the claimed dependence on local acoustics. The state-of-the-art claim is asserted without numbers, baselines, or protocol details visible in the abstract, so the strength of the empirical case remains to be checked in the full text. The thinking is direct and the problem is well scoped. This is worth a reading group slot for anyone working on alignment or continuous annotation tasks in speech or affective computing. I would cite the delayed sinc construction if I needed a differentiable shift operator in similar work. It deserves peer review because the technique is specific enough to evaluate and the datasets are public, even if the non-stationary justification needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a multi-delay sinc network, a CNN architecture with convolutional layers followed by an aligner network implemented via delayed sinc layers. Each delayed sinc layer is a time-shifted low-pass sinc filter that learns a single delay through gradient descent. Stacking multiple such layers is claimed to compensate for non-stationary, acoustic-space-dependent delays between speech signals and continuous emotion labels (arousal, valence). The system is evaluated on RECOLA and SEWA datasets and asserted to achieve state-of-the-art speech-only performance by jointly learning alignment and prediction in an end-to-end manner.

Significance. If the empirical claims hold and the alignment mechanism is shown to function as described, the work could meaningfully advance continuous emotion recognition by providing an integrated, differentiable solution to label misalignment arising from human reaction times. The delayed sinc layer introduces a novel, gradient-based time-shift operator that may have utility beyond this domain in other misaligned signal-processing tasks.

major comments (2)

[Abstract / Aligner network] Abstract and aligner network description: The manuscript states that 'Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space.' However, each delayed sinc layer is defined to learn exactly one fixed delay via gradient descent on a time-shifted sinc filter. A convolutional stack therefore applies a fixed collection of constant delays uniformly across time and inputs, with no mechanism described for dynamic selection, per-timestep modulation, or content-dependent routing. This directly undermines the central claim that the architecture models non-stationary, acoustic-dependent misalignments.
[Experimental evaluation] Experimental results: The abstract asserts state-of-the-art speech-only results on RECOLA and SEWA, yet supplies no numerical scores (e.g., CCC values), baseline comparisons, error bars, statistical tests, or protocol details (train/test splits, cross-validation). Without these, the performance claims cannot be verified and the efficacy of the joint alignment cannot be assessed.

minor comments (2)

[Method] The mathematical formulation of the delayed sinc layer (filter impulse response, gradient computation for the delay parameter) should be provided explicitly with an equation to allow reproduction.
[Results] Clarify whether the learned delays are visualized or analyzed post-training to confirm they capture reaction-time effects rather than arbitrary shifts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify key aspects of our work. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract / Aligner network] Abstract and aligner network description: The manuscript states that 'Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space.' However, each delayed sinc layer is defined to learn exactly one fixed delay via gradient descent on a time-shifted sinc filter. A convolutional stack therefore applies a fixed collection of constant delays uniformly across time and inputs, with no mechanism described for dynamic selection, per-timestep modulation, or content-dependent routing. This directly undermines the central claim that the architecture models non-stationary, acoustic-dependent misalignments.

Authors: We appreciate the referee pointing out this distinction. Each delayed sinc layer learns one fixed delay, resulting in a fixed collection of delays across the stack. Our original phrasing intended to convey that different learned delays can address misalignments that vary with acoustic space or conditions (i.e., different constant delays under different recording environments). However, we agree there is no mechanism for per-timestep or content-dependent adjustment. We will revise the abstract and aligner network section to state more precisely that multiple layers learn a set of fixed delays to handle varying but constant misalignments across acoustic spaces, removing the implication of explicitly non-stationary (time-varying) modeling. These wording changes will be made in the revised manuscript. revision: yes
Referee: [Experimental evaluation] Experimental results: The abstract asserts state-of-the-art speech-only results on RECOLA and SEWA, yet supplies no numerical scores (e.g., CCC values), baseline comparisons, error bars, statistical tests, or protocol details (train/test splits, cross-validation). Without these, the performance claims cannot be verified and the efficacy of the joint alignment cannot be assessed.

Authors: We agree that the abstract would be strengthened by including concrete performance metrics. The full manuscript contains the detailed results, including CCC values for arousal and valence, baseline comparisons, and evaluation protocols on RECOLA and SEWA. In the revision, we will update the abstract to summarize key numerical results (e.g., specific CCC scores), note the speech-only SOTA comparisons, and briefly reference the cross-validation protocol. This will make the claims verifiable from the abstract alone while retaining the full details in the experimental section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation on external datasets

full rationale

The paper introduces a multi-delay sinc network architecture for joint alignment and prediction of continuous emotion labels, with the central claims resting on empirical performance metrics obtained by training and evaluating on the public RECOLA and SEWA datasets. No load-bearing step reduces by construction to a fitted parameter renamed as a prediction, a self-citation chain, or a self-definitional relation; the delayed sinc layers are defined as trainable components whose learned delays are not presupposed in the reported results. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the modeling assumption that time-varying delays can be captured by gradient-learned shifts inside sinc filters; this modeling choice is introduced by the paper without external validation.

free parameters (1)

learned per-layer delays
The time shifts are parameters optimized by gradient descent on the emotion-prediction loss.

axioms (1)

domain assumption A time shift between speech and continuous emotion labels can be modeled by a low-pass sinc filter whose delay is learned end-to-end
This is the core premise that justifies the new delayed sinc layer.

invented entities (1)

delayed sinc layer no independent evidence
purpose: To implement learnable time alignment inside a convolutional stack
New component introduced by the paper; no independent evidence outside the reported experiments is mentioned.

pith-pipeline@v0.9.0 · 5745 in / 1352 out tokens · 28956 ms · 2026-05-25T01:55:17.247498+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The delayed sinc layer … is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

… we introduce the delayed sinc layer … hsinc[n;τ] = (2fc/fs)sinc[2fc(n/fs−τ)]hw[n]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 6 internal anchors

[1]

An argument for basic emotions,

P . Ekman, “An argument for basic emotions,” Cognition & emotion, vol. 6, no. 3-4, pp. 169–200, 1992

work page 1992
[2]

Wild wild emotion: a multimodal ensemble approach,

J. Gideon, B. Zhang, Z. Aldeneh, Y. Kim, S. Khorram, D. Le, and E. M. Provost, “Wild wild emotion: a multimodal ensemble approach,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 2016, pp. 501–505

work page 2016
[3]

Universals and cultural variations in 22 emotional ex- pressions across ﬁve cultures

D. Cordaro, R. Sun, D. Keltner, S. Kamble, N. Huddar, and G. McNeil, “Universals and cultural variations in 22 emotional ex- pressions across ﬁve cultures.” Emotion (Washington, DC), vol. 18, no. 1, pp. 75–93, 2018

work page 2018
[4]

Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,

B. W. Schuller, “Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, 2018

work page 2018
[5]

Continuous estimation of emotions in speech by dynamic cooperative speaker models,

A. Mencattini, E. Martinelli, F. Ringeval, B. Schuller, and C. Di Na- tale, “Continuous estimation of emotions in speech by dynamic cooperative speaker models,” IEEE transactions on affective comput- ing, vol. 8, no. 3, pp. 314–327, 2017

work page 2017
[6]

Av+ec 2015: the ﬁrst affect recognition challenge bridging across audio, video, and physiological data,

F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic, “Av+ec 2015: the ﬁrst affect recognition challenge bridging across audio, video, and physiological data,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 3–8

work page 2015
[7]

Pool- ing acoustic and lexical features for the prediction of valence,

Z. Aldeneh, S. Khorram, D. Dimitriadis, and E. M. Provost, “Pool- ing acoustic and lexical features for the prediction of valence,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 2017, pp. 68–72

work page 2017
[8]

End-to-end learning for dimensional emotion recognition from physiological signals,

G. Keren, T. Kirschstein, E. Marchi, F. Ringeval, and B. Schuller, “End-to-end learning for dimensional emotion recognition from physiological signals,” in Multimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 2017, pp. 985–990

work page 2017
[9]

Study of dense network ap- proaches for speech emotion recognition,

M. Abdelwahab and C. Busso, “Study of dense network ap- proaches for speech emotion recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, AB, Canada, 2018

work page 2018
[10]

Ladder networks for emotion recognition: Using unsupervised auxiliary tasks to improve pre- dictions of emotional attributes,

S. Parthasarathy and C. Busso, “Ladder networks for emotion recognition: Using unsupervised auxiliary tasks to improve pre- dictions of emotional attributes,” Proc. Interspeech 2018, pp. 3698– 3702, 2018

work page 2018
[11]

Exploiting acoustic and lexical properties of phonemes to recognize valence from speech,

B. Zhang, S. Khorram, and E. M. Provost, “Exploiting acoustic and lexical properties of phonemes to recognize valence from speech,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5871–5875

work page 2019
[12]

End-to-end multimodal emotion recognition using deep neural networks,

P . Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017

work page 2017
[13]

Summary for avec 2017: real-life depression and affect challenge and workshop,

F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, and M. Pantic, “Summary for avec 2017: real-life depression and affect challenge and workshop,” in Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017, pp. 1963–1964

work page 2017
[14]

Learning representations of emotional speech with deep convolutional generative adversarial networks,

J. Chang and S. Scherer, “Learning representations of emotional speech with deep convolutional generative adversarial networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 2746–2750

work page 2017
[15]

Toward effective automatic recognition systems of emotion in speech,

C. Busso, M. Bulut, S. Narayanan, J. Gratch, and S. Marsella, “Toward effective automatic recognition systems of emotion in speech,” Social emotions in nature and artifact: emotions in human and human-computer interaction, J. Gratch and S. Marsella, Eds , pp. 110–127, 2013

work page 2013
[16]

Emotion recognition in human- computer interaction,

R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human- computer interaction,” IEEE Signal processing magazine , vol. 18, no. 1, pp. 32–80, 2001

work page 2001
[17]

Describing the emotional states that are expressed in speech,

R. Cowie and R. R. Cornelius, “Describing the emotional states that are expressed in speech,” Speech communication , vol. 40, no. 1-2, pp. 5–32, 2003

work page 2003
[18]

Jointly predicting arousal, valence and dominance with multi-task learning,

S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning,” INTERSPEECH, Stock- holm, Sweden, 2017

work page 2017
[19]

Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences,

B. Zhang, E. M. Provost, and G. Essl, “Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences,” IEEE Transactions on Affec- tive Computing, 2017

work page 2017
[20]

Prediction-based learning for continuous emotion recognition in speech,

J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Prediction-based learning for continuous emotion recognition in speech,” in Acous- 14 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING tics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5005–5009

work page 2017
[21]

Continuous multimodal emotion prediction based on long short term memory recurrent neural network,

J. Huang, Y. Li, J. Tao, Z. Lian, Z. Wen, M. Yang, and J. Yi, “Continuous multimodal emotion prediction based on long short term memory recurrent neural network,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . ACM, 2017, pp. 11–18

work page 2017
[22]

An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction,

Z. Huang, T. Dang, N. Cummins, B. Stasak, P . Le, V . Sethu, and J. Epps, “An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 41–48

work page 2015
[23]

Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,

S. Mariooryad and C. Busso, “Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,” IEEE Transactions on Affective Computing, vol. 6, no. 2, pp. 97–108, 2015

work page 2015
[24]

Robust continuous prediction of human emotions using multi- scale dynamic cues,

J. Nicolle, V . Rapp, K. Bailly, L. Prevost, and M. Chetouani, “Robust continuous prediction of human emotions using multi- scale dynamic cues,” in Proceedings of the 14th ACM international conference on Multimodal interaction. ACM, 2012, pp. 501–508

work page 2012
[25]

A history of experimental psychology,

E. B ¨oring, “A history of experimental psychology,” New York: Appleton-Century, 1950

work page 1950
[26]

Errors of judgement at greenwich in 1796

J. D. Mollon and A. J. Perkins, “Errors of judgement at greenwich in 1796.” Nature, 1996

work page 1996
[27]

On the speed of different senses and nerve transmis- sion by hirsch (1862),

S. Nicolas, “On the speed of different senses and nerve transmis- sion by hirsch (1862),” Psychological Research , vol. 59, no. 4, pp. 261–268, 1997

work page 1997
[28]

Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations,

S. Mariooryad and C. Busso, “Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations,” in Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 2013, pp. 85–90

work page 2013
[29]

Automatic segmenta- tion of spontaneous data using dimensional labels from multiple coders,

M. A. Nicolaou, H. Gunes, and M. Pantic, “Automatic segmenta- tion of spontaneous data using dimensional labels from multiple coders,” in Proc. of LREC Int. Workshop on Multimodal Corpora: Ad- vances in Capturing, Coding and Analyzing Multimodality . Citeseer, 2010, pp. 43–48

work page 2010
[30]

Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nico- laou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5200–5204

work page 2016
[31]

Continuous predic- tion of spontaneous affect from multiple cues and modalities in valence-arousal space,

M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous predic- tion of spontaneous affect from multiple cues and modalities in valence-arousal space,” IEEE Transactions on Affective Computing , vol. 2, no. 2, pp. 92–105, 2011

work page 2011
[32]

Capturing long-term temporal dependencies with con- volutional networks for continuous emotion recognition,

S. Khorram, Z. Aldeneh, D. Dimitriadis, M. McInnis, and E. M. Provost, “Capturing long-term temporal dependencies with con- volutional networks for continuous emotion recognition,” Proc. Interspeech 2017, pp. 1253–1257, 2017

work page 2017
[33]

Prediction of asynchronous dimen- sional emotion ratings from audiovisual and physiological data,

F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-P . Thiran, T. Ebrahimi, D. Lalanne, and B. Schuller, “Prediction of asynchronous dimen- sional emotion ratings from audiovisual and physiological data,” Pattern Recognition Letters, vol. 66, pp. 22–30, 2015

work page 2015
[34]

Discretized continuous speech emotion recognition with multi-task deep recurrent neural network

D. Le, Z. Aldeneh, and E. M. Provost, “Discretized continuous speech emotion recognition with multi-task deep recurrent neural network.” in INTERSPEECH, 2017, pp. 1108–1112

work page 2017
[35]

Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions,

F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions,” in International Conference and Workshops on Automatic Face and Gesture Recognition (FG) . IEEE, 2013, pp. 1–8

work page 2013
[36]

Avec 2017: real-life depression, and affect recognition workshop and chal- lenge,

F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “Avec 2017: real-life depression, and affect recognition workshop and chal- lenge,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017, pp. 3–9

work page 2017
[37]

Multimodal multi-task learning for dimensional and continuous emotion recognition,

S. Chen, Q. Jin, J. Zhao, and S. Wang, “Multimodal multi-task learning for dimensional and continuous emotion recognition,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017, pp. 19–26

work page 2017
[38]

Long short-term memory recurrent neural network architectures for large scale acoustic modeling,

H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014

work page 2014
[39]

Multi-Scale Context Aggregation by Dilated Convolutions

F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[40]

Learning deconvolution network for semantic segmentation,

H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015, pp. 1520–1528

work page 2015
[41]

Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion,

V . Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion,” IEEE transactions on pattern analysis and machine intelligence , vol. 39, no. 12, pp. 2481–2495, 2017

work page 2017
[42]

Avec 2018 workshop and challenge: bipolar disorder and cross-cultural affect recognition,

F. Ringeval, B. Schuller, M. Valstar, R. Cowie, H. Kaya, M. Schmitt, S. Amiriparian, N. Cummins, D. Lalanne, A. Michaud et al., “Avec 2018 workshop and challenge: bipolar disorder and cross-cultural affect recognition,” in Proceedings of the 2018 on Audio/Visual Emo- tion Challenge and Workshop. ACM, 2018, pp. 3–13

work page 2018
[43]

Towards a better gold standard: denoising and modelling continuous emotion annota- tions based on feature agglomeration and outlier regularisation,

C. Wang, P . Lopes, T. Pun, and G. Chanel, “Towards a better gold standard: denoising and modelling continuous emotion annota- tions based on feature agglomeration and outlier regularisation,” in Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 2018, pp. 73–81

work page 2018
[44]

Speech-based continuous emotion prediction by learning perception responses related to salient events: a study based on vocal affect bursts and cross-cultural affect in avec 2018,

K. Wataraka Gamage, T. Dang, V . Sethu, J. Epps, and E. Ambikaira- jah, “Speech-based continuous emotion prediction by learning perception responses related to salient events: a study based on vocal affect bursts and cross-cultural affect in avec 2018,” in Pro- ceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 2018, pp. 47–55

work page 2018
[45]

Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,

J. Zhao, R. Li, S. Chen, and Q. Jin, “Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,” in Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 2018, pp. 65–72

work page 2018
[46]

Multimodal affective dimension prediction using deep bidirectional long short- term memory recurrent neural networks,

L. He, D. Jiang, L. Yang, E. Pei, P . Wu, and H. Sahli, “Multimodal affective dimension prediction using deep bidirectional long short- term memory recurrent neural networks,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge . ACM, 2015, pp. 73–80

work page 2015
[47]

Recent develop- ments in opensmile, the munich open-source multimedia feature extractor,

F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent develop- ments in opensmile, the munich open-source multimedia feature extractor,” in Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013, pp. 835–838

work page 2013
[48]

Yaafe, an easy to use and efﬁcient audio feature extraction software

B. Mathieu, S. Essid, T. Fillon, J. Prado, and G. Richard, “Yaafe, an easy to use and efﬁcient audio feature extraction software.” in ISMIR, 2010, pp. 441–446

work page 2010
[49]

Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,

K. Brady, Y. Gwon, P . Khorrami, E. Godoy, W. Campbell, C. Dagli, and T. S. Huang, “Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . ACM, 2016, pp. 97–104

work page 2016
[50]

Multimodal emotion recognition for AVEC 2016 challenge,

F. Povolny, P . Matejka, M. Hradis, A. Popkov ´a, L. Otrusina, P . Smrz, I. Wood, C. Robin, and L. Lamel, “Multimodal emotion recognition for AVEC 2016 challenge,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . ACM, 2016, pp. 75–82

work page 2016
[51]

The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y. Devillers, J. Epps, P . Laukka, S. S. Narayanan et al., “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016

work page 2016
[52]

Decon- volutional networks,

M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Decon- volutional networks,” in Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2010, pp. 2528–2535

work page 2010
[53]

A guide to convolution arithmetic for deep learning

V . Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[54]

Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the ﬁrst challenge,

B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the ﬁrst challenge,” Speech Communication, vol. 53, no. 9-10, pp. 1062–1087, 2011

work page 2011
[55]

Soundnet: learning sound representations from unlabeled video,

Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: learning sound representations from unlabeled video,” in Advances in Neural In- formation Processing Systems, 2016, pp. 892–900

work page 2016
[56]

AVEC 2016: depression, mood, and emotion recognition workshop and challenge,

M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “AVEC 2016: depression, mood, and emotion recognition workshop and challenge,” in Proceedings of the 6th International Workshop on Au- dio/Visual Emotion Challenge. ACM, 2016, pp. 3–10

work page 2016
[57]

Openear-introducing the munich open-source emotion and affect recognition toolkit,

F. Eyben, M. W ¨ollmer, and B. Schuller, “Openear-introducing the munich open-source emotion and affect recognition toolkit,” in Affective computing and intelligent interaction and workshops, 2009. ACII 2009. 3rd international conference on . IEEE, 2009, pp. 1–6

work page 2009
[58]

Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques

L. Muda, M. Begam, and I. Elamvazuthi, “Voice recogni- tion algorithms using mel frequency cepstral coefﬁcient (mfcc) and dynamic time warping (dtw) techniques,” arXiv preprint arXiv:1003.4083, 2010. KHORRAM et al.: JOINTL Y ALIGNING AND PREDICTING CONTINUOUS EMOTION ANNOTATIONS 15

work page internal anchor Pith review Pith/arXiv arXiv 2010
[59]

Recog- nition of depression in bipolar disorder: Leveraging cohort and person-speciﬁc knowledge

S. Khorram, J. Gideon, M. G. McInnis, and E. M. Provost, “Recog- nition of depression in bipolar disorder: Leveraging cohort and person-speciﬁc knowledge.” in INTERSPEECH, 2016, pp. 1215– 1219

work page 2016
[60]

Mel-generalized cepstral analysis-a uniﬁed approach to speech spectral estima- tion,

K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalized cepstral analysis-a uniﬁed approach to speech spectral estima- tion,” in Third International Conference on Spoken Language Process- ing, 1994

work page 1994
[61]

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthe- sis,

S. Khorram, H. Sameti, F. Bahmaninezhad, S. King, and T. Drug- man, “Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthe- sis,” EURASIP Journal on Audio, Speech, and Music Processing , vol. 2014, no. 1, p. 12, 2014

work page 2014
[62]

Soft context clustering for f0 modeling in hmm-based speech synthesis,

S. Khorram, H. Sameti, and S. King, “Soft context clustering for f0 modeling in hmm-based speech synthesis,” EURASIP Journal on Advances in Signal Processing, vol. 2015, no. 1, p. 2, 2015

work page 2015
[63]

Speech synthesis based on gaussian conditional random ﬁelds,

S. Khorram, F. Bahmaninezhad, and H. Sameti, “Speech synthesis based on gaussian conditional random ﬁelds,” in International Symposium on Artiﬁcial Intelligence and Signal Processing. Springer, 2013, pp. 183–193

work page 2013
[64]

Using neutral speech models for emotional speech analysis

C. Busso, S. Lee, and S. S. Narayanan, “Using neutral speech models for emotional speech analysis.” in Interspeech, 2007, pp. 2225–2228

work page 2007
[65]

The priori emotion dataset: Linking mood to emotion detected in-the-wild,

S. Khorram, M. Jaiswal, J. Gideon, M. McInnis, and E.-M. Provost, “The priori emotion dataset: Linking mood to emotion detected in-the-wild,” Proc. Interspeech 2018, pp. 1903–1907, 2018

work page 2018
[66]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motlicek, Y. Qian, P . Schwarz et al. , “The Kaldi speech recognition toolkit,” in workshop on automatic speech recognition and understanding (ASRU). IEEE, 2011

work page 2011
[67]

Emotion recognition from sponta- neous speech using hidden markov models with deep belief networks,

D. Le and E. M. Provost, “Emotion recognition from sponta- neous speech using hidden markov models with deep belief networks,” in workshop on automatic speech recognition and under- standing (ASRU). IEEE, 2013, pp. 216–221

work page 2013
[68]

Adam: A Method for Stochastic Optimization

D. Kingma and J. Ba, “Adam: a method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[69]

S. K. Mitra and Y. Kuo, Digital signal processing: a computer-based approach. McGraw-Hill Higher Education, 2006, vol. 2

work page 2006
[70]

Trainable Time Warping: Aligning Time-Series in the Continuous-Time Domain

S. Khorram, M. G. McInnis, and E. M. Provost, “Trainable time warping: Aligning time-series in the continuous-time domain,” arXiv preprint arXiv:1903.09245, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[71]

Automatic detection of laugh- ter,

K. Truong and D. van Leeuwen, “Automatic detection of laugh- ter,” in 9th European Conference on Speech Communication and Tech- nology, 4 September 2005 through 8 September 2005, Lisbon,, 485-488 , 2005

work page 2005
[72]

Acoustic analysis of laughter,

C. A. Bickley and S. Hunnicutt, “Acoustic analysis of laughter,” in Second International Conference on Spoken Language Processing , 1992

work page 1992
[73]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

M. Abadi, A. Agarwal, P . Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorﬂow: large-scale machine learning on heterogeneous distributed sys- tems,” arXiv preprint arXiv:1603.04467, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[74]

Progressive neural networks for transfer learning in emotion recognition,

J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, and E. M. Provost, “Progressive neural networks for transfer learning in emotion recognition,” Proc. Interspeech 2017, pp. 1098–1102, 2017

work page 2017
[75]

Investigating word affect features and fusion of probabilistic predictions incor- porating uncertainty in avec 2017,

T. Dang, B. Stasak, Z. Huang, S. Jayawardena, M. Atcheson, M. Hayat, P . Le, V . Sethu, R. Goecke, and J. Epps, “Investigating word affect features and fusion of probabilistic predictions incor- porating uncertainty in avec 2017,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . ACM, 2017, pp. 27– 35

work page 2017
[76]

Track- ing changes in continuous emotion states using body language and prosodic cues,

A. Metallinou, A. Katsamanis, Y. Wang, and S. Narayanan, “Track- ing changes in continuous emotion states using body language and prosodic cues,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on . IEEE, 2011, pp. 2288–2291

work page 2011
[77]

Age and sex differences in reaction time in adulthood: results from the united kingdom health and lifestyle survey

G. Der and I. J. Deary, “Age and sex differences in reaction time in adulthood: results from the united kingdom health and lifestyle survey.” Psychology and aging, vol. 21, no. 1, p. 62, 2006

work page 2006
[78]

Concentration analysis: a quantitative assessment of student states,

L. Bao and E. F. Redish, “Concentration analysis: a quantitative assessment of student states,” American Journal of Physics , vol. 69, no. S1, pp. S45–S53, 2001

work page 2001
[79]

Video-based emotion recognition in the wild using deep transfer learning and score fusion,

H. Kaya, F. G ¨urpınar, and A. A. Salah, “Video-based emotion recognition in the wild using deep transfer learning and score fusion,” Image and Vision Computing, vol. 65, pp. 66–75, 2017

work page 2017
[80]

Audio-visual emotion recognition in video clips,

F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. An- barjafari, “Audio-visual emotion recognition in video clips,” IEEE Transactions on Affective Computing, 2017

work page 2017

Showing first 80 references.

[1] [1]

An argument for basic emotions,

P . Ekman, “An argument for basic emotions,” Cognition & emotion, vol. 6, no. 3-4, pp. 169–200, 1992

work page 1992

[2] [2]

Wild wild emotion: a multimodal ensemble approach,

J. Gideon, B. Zhang, Z. Aldeneh, Y. Kim, S. Khorram, D. Le, and E. M. Provost, “Wild wild emotion: a multimodal ensemble approach,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 2016, pp. 501–505

work page 2016

[3] [3]

Universals and cultural variations in 22 emotional ex- pressions across ﬁve cultures

D. Cordaro, R. Sun, D. Keltner, S. Kamble, N. Huddar, and G. McNeil, “Universals and cultural variations in 22 emotional ex- pressions across ﬁve cultures.” Emotion (Washington, DC), vol. 18, no. 1, pp. 75–93, 2018

work page 2018

[4] [4]

Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,

B. W. Schuller, “Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, 2018

work page 2018

[5] [5]

Continuous estimation of emotions in speech by dynamic cooperative speaker models,

A. Mencattini, E. Martinelli, F. Ringeval, B. Schuller, and C. Di Na- tale, “Continuous estimation of emotions in speech by dynamic cooperative speaker models,” IEEE transactions on affective comput- ing, vol. 8, no. 3, pp. 314–327, 2017

work page 2017

[6] [6]

Av+ec 2015: the ﬁrst affect recognition challenge bridging across audio, video, and physiological data,

F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic, “Av+ec 2015: the ﬁrst affect recognition challenge bridging across audio, video, and physiological data,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 3–8

work page 2015

[7] [7]

Pool- ing acoustic and lexical features for the prediction of valence,

Z. Aldeneh, S. Khorram, D. Dimitriadis, and E. M. Provost, “Pool- ing acoustic and lexical features for the prediction of valence,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 2017, pp. 68–72

work page 2017

[8] [8]

End-to-end learning for dimensional emotion recognition from physiological signals,

G. Keren, T. Kirschstein, E. Marchi, F. Ringeval, and B. Schuller, “End-to-end learning for dimensional emotion recognition from physiological signals,” in Multimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 2017, pp. 985–990

work page 2017

[9] [9]

Study of dense network ap- proaches for speech emotion recognition,

M. Abdelwahab and C. Busso, “Study of dense network ap- proaches for speech emotion recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, AB, Canada, 2018

work page 2018

[10] [10]

Ladder networks for emotion recognition: Using unsupervised auxiliary tasks to improve pre- dictions of emotional attributes,

S. Parthasarathy and C. Busso, “Ladder networks for emotion recognition: Using unsupervised auxiliary tasks to improve pre- dictions of emotional attributes,” Proc. Interspeech 2018, pp. 3698– 3702, 2018

work page 2018

[11] [11]

Exploiting acoustic and lexical properties of phonemes to recognize valence from speech,

B. Zhang, S. Khorram, and E. M. Provost, “Exploiting acoustic and lexical properties of phonemes to recognize valence from speech,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5871–5875

work page 2019

[12] [12]

End-to-end multimodal emotion recognition using deep neural networks,

P . Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017

work page 2017

[13] [13]

Summary for avec 2017: real-life depression and affect challenge and workshop,

F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, and M. Pantic, “Summary for avec 2017: real-life depression and affect challenge and workshop,” in Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017, pp. 1963–1964

work page 2017

[14] [14]

Learning representations of emotional speech with deep convolutional generative adversarial networks,

J. Chang and S. Scherer, “Learning representations of emotional speech with deep convolutional generative adversarial networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 2746–2750

work page 2017

[15] [15]

Toward effective automatic recognition systems of emotion in speech,

C. Busso, M. Bulut, S. Narayanan, J. Gratch, and S. Marsella, “Toward effective automatic recognition systems of emotion in speech,” Social emotions in nature and artifact: emotions in human and human-computer interaction, J. Gratch and S. Marsella, Eds , pp. 110–127, 2013

work page 2013

[16] [16]

Emotion recognition in human- computer interaction,

R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human- computer interaction,” IEEE Signal processing magazine , vol. 18, no. 1, pp. 32–80, 2001

work page 2001

[17] [17]

Describing the emotional states that are expressed in speech,

R. Cowie and R. R. Cornelius, “Describing the emotional states that are expressed in speech,” Speech communication , vol. 40, no. 1-2, pp. 5–32, 2003

work page 2003

[18] [18]

Jointly predicting arousal, valence and dominance with multi-task learning,

S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning,” INTERSPEECH, Stock- holm, Sweden, 2017

work page 2017

[19] [19]

Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences,

B. Zhang, E. M. Provost, and G. Essl, “Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences,” IEEE Transactions on Affec- tive Computing, 2017

work page 2017

[20] [20]

Prediction-based learning for continuous emotion recognition in speech,

J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Prediction-based learning for continuous emotion recognition in speech,” in Acous- 14 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING tics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5005–5009

work page 2017

[21] [21]

Continuous multimodal emotion prediction based on long short term memory recurrent neural network,

J. Huang, Y. Li, J. Tao, Z. Lian, Z. Wen, M. Yang, and J. Yi, “Continuous multimodal emotion prediction based on long short term memory recurrent neural network,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . ACM, 2017, pp. 11–18

work page 2017

[22] [22]

An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction,

Z. Huang, T. Dang, N. Cummins, B. Stasak, P . Le, V . Sethu, and J. Epps, “An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 41–48

work page 2015

[23] [23]

Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,

S. Mariooryad and C. Busso, “Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,” IEEE Transactions on Affective Computing, vol. 6, no. 2, pp. 97–108, 2015

work page 2015

[24] [24]

Robust continuous prediction of human emotions using multi- scale dynamic cues,

J. Nicolle, V . Rapp, K. Bailly, L. Prevost, and M. Chetouani, “Robust continuous prediction of human emotions using multi- scale dynamic cues,” in Proceedings of the 14th ACM international conference on Multimodal interaction. ACM, 2012, pp. 501–508

work page 2012

[25] [25]

A history of experimental psychology,

E. B ¨oring, “A history of experimental psychology,” New York: Appleton-Century, 1950

work page 1950

[26] [26]

Errors of judgement at greenwich in 1796

J. D. Mollon and A. J. Perkins, “Errors of judgement at greenwich in 1796.” Nature, 1996

work page 1996

[27] [27]

On the speed of different senses and nerve transmis- sion by hirsch (1862),

S. Nicolas, “On the speed of different senses and nerve transmis- sion by hirsch (1862),” Psychological Research , vol. 59, no. 4, pp. 261–268, 1997

work page 1997

[28] [28]

Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations,

S. Mariooryad and C. Busso, “Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations,” in Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 2013, pp. 85–90

work page 2013

[29] [29]

Automatic segmenta- tion of spontaneous data using dimensional labels from multiple coders,

M. A. Nicolaou, H. Gunes, and M. Pantic, “Automatic segmenta- tion of spontaneous data using dimensional labels from multiple coders,” in Proc. of LREC Int. Workshop on Multimodal Corpora: Ad- vances in Capturing, Coding and Analyzing Multimodality . Citeseer, 2010, pp. 43–48

work page 2010

[30] [30]

Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nico- laou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5200–5204

work page 2016

[31] [31]

Continuous predic- tion of spontaneous affect from multiple cues and modalities in valence-arousal space,

M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous predic- tion of spontaneous affect from multiple cues and modalities in valence-arousal space,” IEEE Transactions on Affective Computing , vol. 2, no. 2, pp. 92–105, 2011

work page 2011

[32] [32]

Capturing long-term temporal dependencies with con- volutional networks for continuous emotion recognition,

S. Khorram, Z. Aldeneh, D. Dimitriadis, M. McInnis, and E. M. Provost, “Capturing long-term temporal dependencies with con- volutional networks for continuous emotion recognition,” Proc. Interspeech 2017, pp. 1253–1257, 2017

work page 2017

[33] [33]

Prediction of asynchronous dimen- sional emotion ratings from audiovisual and physiological data,

F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-P . Thiran, T. Ebrahimi, D. Lalanne, and B. Schuller, “Prediction of asynchronous dimen- sional emotion ratings from audiovisual and physiological data,” Pattern Recognition Letters, vol. 66, pp. 22–30, 2015

work page 2015

[34] [34]

Discretized continuous speech emotion recognition with multi-task deep recurrent neural network

D. Le, Z. Aldeneh, and E. M. Provost, “Discretized continuous speech emotion recognition with multi-task deep recurrent neural network.” in INTERSPEECH, 2017, pp. 1108–1112

work page 2017

[35] [35]

Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions,

F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions,” in International Conference and Workshops on Automatic Face and Gesture Recognition (FG) . IEEE, 2013, pp. 1–8

work page 2013

[36] [36]

Avec 2017: real-life depression, and affect recognition workshop and chal- lenge,

F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “Avec 2017: real-life depression, and affect recognition workshop and chal- lenge,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017, pp. 3–9

work page 2017

[37] [37]

Multimodal multi-task learning for dimensional and continuous emotion recognition,

S. Chen, Q. Jin, J. Zhao, and S. Wang, “Multimodal multi-task learning for dimensional and continuous emotion recognition,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017, pp. 19–26

work page 2017

[38] [38]

Long short-term memory recurrent neural network architectures for large scale acoustic modeling,

H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014

work page 2014

[39] [39]

Multi-Scale Context Aggregation by Dilated Convolutions

F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[40] [40]

Learning deconvolution network for semantic segmentation,

H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015, pp. 1520–1528

work page 2015

[41] [41]

Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion,

V . Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion,” IEEE transactions on pattern analysis and machine intelligence , vol. 39, no. 12, pp. 2481–2495, 2017

work page 2017

[42] [42]

Avec 2018 workshop and challenge: bipolar disorder and cross-cultural affect recognition,

F. Ringeval, B. Schuller, M. Valstar, R. Cowie, H. Kaya, M. Schmitt, S. Amiriparian, N. Cummins, D. Lalanne, A. Michaud et al., “Avec 2018 workshop and challenge: bipolar disorder and cross-cultural affect recognition,” in Proceedings of the 2018 on Audio/Visual Emo- tion Challenge and Workshop. ACM, 2018, pp. 3–13

work page 2018

[43] [43]

Towards a better gold standard: denoising and modelling continuous emotion annota- tions based on feature agglomeration and outlier regularisation,

C. Wang, P . Lopes, T. Pun, and G. Chanel, “Towards a better gold standard: denoising and modelling continuous emotion annota- tions based on feature agglomeration and outlier regularisation,” in Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 2018, pp. 73–81

work page 2018

[44] [44]

Speech-based continuous emotion prediction by learning perception responses related to salient events: a study based on vocal affect bursts and cross-cultural affect in avec 2018,

K. Wataraka Gamage, T. Dang, V . Sethu, J. Epps, and E. Ambikaira- jah, “Speech-based continuous emotion prediction by learning perception responses related to salient events: a study based on vocal affect bursts and cross-cultural affect in avec 2018,” in Pro- ceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 2018, pp. 47–55

work page 2018

[45] [45]

Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,

J. Zhao, R. Li, S. Chen, and Q. Jin, “Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,” in Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 2018, pp. 65–72

work page 2018

[46] [46]

Multimodal affective dimension prediction using deep bidirectional long short- term memory recurrent neural networks,

L. He, D. Jiang, L. Yang, E. Pei, P . Wu, and H. Sahli, “Multimodal affective dimension prediction using deep bidirectional long short- term memory recurrent neural networks,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge . ACM, 2015, pp. 73–80

work page 2015

[47] [47]

Recent develop- ments in opensmile, the munich open-source multimedia feature extractor,

F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent develop- ments in opensmile, the munich open-source multimedia feature extractor,” in Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013, pp. 835–838

work page 2013

[48] [48]

Yaafe, an easy to use and efﬁcient audio feature extraction software

B. Mathieu, S. Essid, T. Fillon, J. Prado, and G. Richard, “Yaafe, an easy to use and efﬁcient audio feature extraction software.” in ISMIR, 2010, pp. 441–446

work page 2010

[49] [49]

Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,

K. Brady, Y. Gwon, P . Khorrami, E. Godoy, W. Campbell, C. Dagli, and T. S. Huang, “Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . ACM, 2016, pp. 97–104

work page 2016

[50] [50]

Multimodal emotion recognition for AVEC 2016 challenge,

F. Povolny, P . Matejka, M. Hradis, A. Popkov ´a, L. Otrusina, P . Smrz, I. Wood, C. Robin, and L. Lamel, “Multimodal emotion recognition for AVEC 2016 challenge,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . ACM, 2016, pp. 75–82

work page 2016

[51] [51]

The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y. Devillers, J. Epps, P . Laukka, S. S. Narayanan et al., “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016

work page 2016

[52] [52]

Decon- volutional networks,

M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Decon- volutional networks,” in Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2010, pp. 2528–2535

work page 2010

[53] [53]

A guide to convolution arithmetic for deep learning

V . Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[54] [54]

Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the ﬁrst challenge,

B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the ﬁrst challenge,” Speech Communication, vol. 53, no. 9-10, pp. 1062–1087, 2011

work page 2011

[55] [55]

Soundnet: learning sound representations from unlabeled video,

Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: learning sound representations from unlabeled video,” in Advances in Neural In- formation Processing Systems, 2016, pp. 892–900

work page 2016

[56] [56]

AVEC 2016: depression, mood, and emotion recognition workshop and challenge,

M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “AVEC 2016: depression, mood, and emotion recognition workshop and challenge,” in Proceedings of the 6th International Workshop on Au- dio/Visual Emotion Challenge. ACM, 2016, pp. 3–10

work page 2016

[57] [57]

Openear-introducing the munich open-source emotion and affect recognition toolkit,

F. Eyben, M. W ¨ollmer, and B. Schuller, “Openear-introducing the munich open-source emotion and affect recognition toolkit,” in Affective computing and intelligent interaction and workshops, 2009. ACII 2009. 3rd international conference on . IEEE, 2009, pp. 1–6

work page 2009

[58] [58]

Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques

L. Muda, M. Begam, and I. Elamvazuthi, “Voice recogni- tion algorithms using mel frequency cepstral coefﬁcient (mfcc) and dynamic time warping (dtw) techniques,” arXiv preprint arXiv:1003.4083, 2010. KHORRAM et al.: JOINTL Y ALIGNING AND PREDICTING CONTINUOUS EMOTION ANNOTATIONS 15

work page internal anchor Pith review Pith/arXiv arXiv 2010

[59] [59]

Recog- nition of depression in bipolar disorder: Leveraging cohort and person-speciﬁc knowledge

S. Khorram, J. Gideon, M. G. McInnis, and E. M. Provost, “Recog- nition of depression in bipolar disorder: Leveraging cohort and person-speciﬁc knowledge.” in INTERSPEECH, 2016, pp. 1215– 1219

work page 2016

[60] [60]

Mel-generalized cepstral analysis-a uniﬁed approach to speech spectral estima- tion,

K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalized cepstral analysis-a uniﬁed approach to speech spectral estima- tion,” in Third International Conference on Spoken Language Process- ing, 1994

work page 1994

[61] [61]

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthe- sis,

S. Khorram, H. Sameti, F. Bahmaninezhad, S. King, and T. Drug- man, “Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthe- sis,” EURASIP Journal on Audio, Speech, and Music Processing , vol. 2014, no. 1, p. 12, 2014

work page 2014

[62] [62]

Soft context clustering for f0 modeling in hmm-based speech synthesis,

S. Khorram, H. Sameti, and S. King, “Soft context clustering for f0 modeling in hmm-based speech synthesis,” EURASIP Journal on Advances in Signal Processing, vol. 2015, no. 1, p. 2, 2015

work page 2015

[63] [63]

Speech synthesis based on gaussian conditional random ﬁelds,

S. Khorram, F. Bahmaninezhad, and H. Sameti, “Speech synthesis based on gaussian conditional random ﬁelds,” in International Symposium on Artiﬁcial Intelligence and Signal Processing. Springer, 2013, pp. 183–193

work page 2013

[64] [64]

Using neutral speech models for emotional speech analysis

C. Busso, S. Lee, and S. S. Narayanan, “Using neutral speech models for emotional speech analysis.” in Interspeech, 2007, pp. 2225–2228

work page 2007

[65] [65]

The priori emotion dataset: Linking mood to emotion detected in-the-wild,

S. Khorram, M. Jaiswal, J. Gideon, M. McInnis, and E.-M. Provost, “The priori emotion dataset: Linking mood to emotion detected in-the-wild,” Proc. Interspeech 2018, pp. 1903–1907, 2018

work page 2018

[66] [66]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motlicek, Y. Qian, P . Schwarz et al. , “The Kaldi speech recognition toolkit,” in workshop on automatic speech recognition and understanding (ASRU). IEEE, 2011

work page 2011

[67] [67]

Emotion recognition from sponta- neous speech using hidden markov models with deep belief networks,

D. Le and E. M. Provost, “Emotion recognition from sponta- neous speech using hidden markov models with deep belief networks,” in workshop on automatic speech recognition and under- standing (ASRU). IEEE, 2013, pp. 216–221

work page 2013

[68] [68]

Adam: A Method for Stochastic Optimization

D. Kingma and J. Ba, “Adam: a method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[69] [69]

S. K. Mitra and Y. Kuo, Digital signal processing: a computer-based approach. McGraw-Hill Higher Education, 2006, vol. 2

work page 2006

[70] [70]

Trainable Time Warping: Aligning Time-Series in the Continuous-Time Domain

S. Khorram, M. G. McInnis, and E. M. Provost, “Trainable time warping: Aligning time-series in the continuous-time domain,” arXiv preprint arXiv:1903.09245, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[71] [71]

Automatic detection of laugh- ter,

K. Truong and D. van Leeuwen, “Automatic detection of laugh- ter,” in 9th European Conference on Speech Communication and Tech- nology, 4 September 2005 through 8 September 2005, Lisbon,, 485-488 , 2005

work page 2005

[72] [72]

Acoustic analysis of laughter,

C. A. Bickley and S. Hunnicutt, “Acoustic analysis of laughter,” in Second International Conference on Spoken Language Processing , 1992

work page 1992

[73] [73]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

M. Abadi, A. Agarwal, P . Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorﬂow: large-scale machine learning on heterogeneous distributed sys- tems,” arXiv preprint arXiv:1603.04467, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[74] [74]

Progressive neural networks for transfer learning in emotion recognition,

J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, and E. M. Provost, “Progressive neural networks for transfer learning in emotion recognition,” Proc. Interspeech 2017, pp. 1098–1102, 2017

work page 2017

[75] [75]

Investigating word affect features and fusion of probabilistic predictions incor- porating uncertainty in avec 2017,

T. Dang, B. Stasak, Z. Huang, S. Jayawardena, M. Atcheson, M. Hayat, P . Le, V . Sethu, R. Goecke, and J. Epps, “Investigating word affect features and fusion of probabilistic predictions incor- porating uncertainty in avec 2017,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . ACM, 2017, pp. 27– 35

work page 2017

[76] [76]

Track- ing changes in continuous emotion states using body language and prosodic cues,

A. Metallinou, A. Katsamanis, Y. Wang, and S. Narayanan, “Track- ing changes in continuous emotion states using body language and prosodic cues,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on . IEEE, 2011, pp. 2288–2291

work page 2011

[77] [77]

Age and sex differences in reaction time in adulthood: results from the united kingdom health and lifestyle survey

G. Der and I. J. Deary, “Age and sex differences in reaction time in adulthood: results from the united kingdom health and lifestyle survey.” Psychology and aging, vol. 21, no. 1, p. 62, 2006

work page 2006

[78] [78]

Concentration analysis: a quantitative assessment of student states,

L. Bao and E. F. Redish, “Concentration analysis: a quantitative assessment of student states,” American Journal of Physics , vol. 69, no. S1, pp. S45–S53, 2001

work page 2001

[79] [79]

Video-based emotion recognition in the wild using deep transfer learning and score fusion,

H. Kaya, F. G ¨urpınar, and A. A. Salah, “Video-based emotion recognition in the wild using deep transfer learning and score fusion,” Image and Vision Computing, vol. 65, pp. 66–75, 2017

work page 2017

[80] [80]

Audio-visual emotion recognition in video clips,

F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. An- barjafari, “Audio-visual emotion recognition in video clips,” IEEE Transactions on Affective Computing, 2017

work page 2017