Jointly Aligning and Predicting Continuous Emotion Annotations
Pith reviewed 2026-05-25 01:55 UTC · model grok-4.3
The pith
A convolutional network learns time shifts via delayed sinc layers to jointly align speech signals with delayed continuous emotion labels and predict arousal and valence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The multi-delay sinc network simultaneously aligns the speech signal and emotion labels by implementing an aligner network as a stack of delayed sinc layers, each a time-shifted low-pass sinc filter that learns a single delay through gradient descent; multiple such layers together handle non-stationary delays that depend on the acoustic space, while the overall model predicts dimensional descriptors of emotions.
What carries the argument
The delayed sinc layer: a convolutional layer that implements a time-shifted low-pass sinc filter and uses gradient-based optimization to learn one delay value per layer; stacking them compensates for variable reaction-time delays between speech and labels.
If this is right
- The model reaches state-of-the-art speech-only results on both RECOLA and SEWA by learning time-varying delays.
- End-to-end training becomes possible without a separate alignment preprocessing step.
- Non-stationary delays that vary with acoustic content can be compensated inside the network.
- Dimensional emotion descriptors can be predicted directly from aligned speech features.
Where Pith is reading between the lines
- The same delayed-sinc alignment mechanism could be tested on other time-series annotation tasks that suffer from human reaction delays, such as continuous affect in video or physiological signals.
- If the learned delays prove stable across speakers or sessions, the approach might reduce the amount of manual synchronization required when collecting new emotion datasets.
- Replacing the sinc-based aligner with other differentiable delay mechanisms could be compared directly on the same datasets to isolate the contribution of the low-pass filter property.
Load-bearing premise
The misalignment between the speech signal and the emotion labels can be adequately compensated by a stack of learned time shifts implemented via delayed sinc layers.
What would settle it
On RECOLA or SEWA, a version of the network with the aligner removed or with all delays fixed to zero fails to match or exceed the performance of the full model while still predicting arousal and valence.
Figures
read the original abstract
Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network (multi-delay sinc network) that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the delayed sinc layer. It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a multi-delay sinc network, a CNN architecture with convolutional layers followed by an aligner network implemented via delayed sinc layers. Each delayed sinc layer is a time-shifted low-pass sinc filter that learns a single delay through gradient descent. Stacking multiple such layers is claimed to compensate for non-stationary, acoustic-space-dependent delays between speech signals and continuous emotion labels (arousal, valence). The system is evaluated on RECOLA and SEWA datasets and asserted to achieve state-of-the-art speech-only performance by jointly learning alignment and prediction in an end-to-end manner.
Significance. If the empirical claims hold and the alignment mechanism is shown to function as described, the work could meaningfully advance continuous emotion recognition by providing an integrated, differentiable solution to label misalignment arising from human reaction times. The delayed sinc layer introduces a novel, gradient-based time-shift operator that may have utility beyond this domain in other misaligned signal-processing tasks.
major comments (2)
- [Abstract / Aligner network] Abstract and aligner network description: The manuscript states that 'Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space.' However, each delayed sinc layer is defined to learn exactly one fixed delay via gradient descent on a time-shifted sinc filter. A convolutional stack therefore applies a fixed collection of constant delays uniformly across time and inputs, with no mechanism described for dynamic selection, per-timestep modulation, or content-dependent routing. This directly undermines the central claim that the architecture models non-stationary, acoustic-dependent misalignments.
- [Experimental evaluation] Experimental results: The abstract asserts state-of-the-art speech-only results on RECOLA and SEWA, yet supplies no numerical scores (e.g., CCC values), baseline comparisons, error bars, statistical tests, or protocol details (train/test splits, cross-validation). Without these, the performance claims cannot be verified and the efficacy of the joint alignment cannot be assessed.
minor comments (2)
- [Method] The mathematical formulation of the delayed sinc layer (filter impulse response, gradient computation for the delay parameter) should be provided explicitly with an equation to allow reproduction.
- [Results] Clarify whether the learned delays are visualized or analyzed post-training to confirm they capture reaction-time effects rather than arbitrary shifts.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify key aspects of our work. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract / Aligner network] Abstract and aligner network description: The manuscript states that 'Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space.' However, each delayed sinc layer is defined to learn exactly one fixed delay via gradient descent on a time-shifted sinc filter. A convolutional stack therefore applies a fixed collection of constant delays uniformly across time and inputs, with no mechanism described for dynamic selection, per-timestep modulation, or content-dependent routing. This directly undermines the central claim that the architecture models non-stationary, acoustic-dependent misalignments.
Authors: We appreciate the referee pointing out this distinction. Each delayed sinc layer learns one fixed delay, resulting in a fixed collection of delays across the stack. Our original phrasing intended to convey that different learned delays can address misalignments that vary with acoustic space or conditions (i.e., different constant delays under different recording environments). However, we agree there is no mechanism for per-timestep or content-dependent adjustment. We will revise the abstract and aligner network section to state more precisely that multiple layers learn a set of fixed delays to handle varying but constant misalignments across acoustic spaces, removing the implication of explicitly non-stationary (time-varying) modeling. These wording changes will be made in the revised manuscript. revision: yes
-
Referee: [Experimental evaluation] Experimental results: The abstract asserts state-of-the-art speech-only results on RECOLA and SEWA, yet supplies no numerical scores (e.g., CCC values), baseline comparisons, error bars, statistical tests, or protocol details (train/test splits, cross-validation). Without these, the performance claims cannot be verified and the efficacy of the joint alignment cannot be assessed.
Authors: We agree that the abstract would be strengthened by including concrete performance metrics. The full manuscript contains the detailed results, including CCC values for arousal and valence, baseline comparisons, and evaluation protocols on RECOLA and SEWA. In the revision, we will update the abstract to summarize key numerical results (e.g., specific CCC scores), note the speech-only SOTA comparisons, and briefly reference the cross-validation protocol. This will make the claims verifiable from the abstract alone while retaining the full details in the experimental section. revision: yes
Circularity Check
No significant circularity; empirical validation on external datasets
full rationale
The paper introduces a multi-delay sinc network architecture for joint alignment and prediction of continuous emotion labels, with the central claims resting on empirical performance metrics obtained by training and evaluating on the public RECOLA and SEWA datasets. No load-bearing step reduces by construction to a fitted parameter renamed as a prediction, a self-citation chain, or a self-definitional relation; the delayed sinc layers are defined as trainable components whose learned delays are not presupposed in the reported results. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- learned per-layer delays
axioms (1)
- domain assumption A time shift between speech and continuous emotion labels can be modeled by a low-pass sinc filter whose delay is learned end-to-end
invented entities (1)
-
delayed sinc layer
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArrowOfTime.leanarrow_from_z unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The delayed sinc layer … is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
… we introduce the delayed sinc layer … hsinc[n;τ] = (2fc/fs)sinc[2fc(n/fs−τ)]hw[n]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
An argument for basic emotions,
P . Ekman, “An argument for basic emotions,” Cognition & emotion, vol. 6, no. 3-4, pp. 169–200, 1992
work page 1992
-
[2]
Wild wild emotion: a multimodal ensemble approach,
J. Gideon, B. Zhang, Z. Aldeneh, Y. Kim, S. Khorram, D. Le, and E. M. Provost, “Wild wild emotion: a multimodal ensemble approach,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 2016, pp. 501–505
work page 2016
-
[3]
Universals and cultural variations in 22 emotional ex- pressions across five cultures
D. Cordaro, R. Sun, D. Keltner, S. Kamble, N. Huddar, and G. McNeil, “Universals and cultural variations in 22 emotional ex- pressions across five cultures.” Emotion (Washington, DC), vol. 18, no. 1, pp. 75–93, 2018
work page 2018
-
[4]
Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,
B. W. Schuller, “Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, 2018
work page 2018
-
[5]
Continuous estimation of emotions in speech by dynamic cooperative speaker models,
A. Mencattini, E. Martinelli, F. Ringeval, B. Schuller, and C. Di Na- tale, “Continuous estimation of emotions in speech by dynamic cooperative speaker models,” IEEE transactions on affective comput- ing, vol. 8, no. 3, pp. 314–327, 2017
work page 2017
-
[6]
F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic, “Av+ec 2015: the first affect recognition challenge bridging across audio, video, and physiological data,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 3–8
work page 2015
-
[7]
Pool- ing acoustic and lexical features for the prediction of valence,
Z. Aldeneh, S. Khorram, D. Dimitriadis, and E. M. Provost, “Pool- ing acoustic and lexical features for the prediction of valence,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 2017, pp. 68–72
work page 2017
-
[8]
End-to-end learning for dimensional emotion recognition from physiological signals,
G. Keren, T. Kirschstein, E. Marchi, F. Ringeval, and B. Schuller, “End-to-end learning for dimensional emotion recognition from physiological signals,” in Multimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 2017, pp. 985–990
work page 2017
-
[9]
Study of dense network ap- proaches for speech emotion recognition,
M. Abdelwahab and C. Busso, “Study of dense network ap- proaches for speech emotion recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, AB, Canada, 2018
work page 2018
-
[10]
S. Parthasarathy and C. Busso, “Ladder networks for emotion recognition: Using unsupervised auxiliary tasks to improve pre- dictions of emotional attributes,” Proc. Interspeech 2018, pp. 3698– 3702, 2018
work page 2018
-
[11]
Exploiting acoustic and lexical properties of phonemes to recognize valence from speech,
B. Zhang, S. Khorram, and E. M. Provost, “Exploiting acoustic and lexical properties of phonemes to recognize valence from speech,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5871–5875
work page 2019
-
[12]
End-to-end multimodal emotion recognition using deep neural networks,
P . Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017
work page 2017
-
[13]
Summary for avec 2017: real-life depression and affect challenge and workshop,
F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, and M. Pantic, “Summary for avec 2017: real-life depression and affect challenge and workshop,” in Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017, pp. 1963–1964
work page 2017
-
[14]
J. Chang and S. Scherer, “Learning representations of emotional speech with deep convolutional generative adversarial networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 2746–2750
work page 2017
-
[15]
Toward effective automatic recognition systems of emotion in speech,
C. Busso, M. Bulut, S. Narayanan, J. Gratch, and S. Marsella, “Toward effective automatic recognition systems of emotion in speech,” Social emotions in nature and artifact: emotions in human and human-computer interaction, J. Gratch and S. Marsella, Eds , pp. 110–127, 2013
work page 2013
-
[16]
Emotion recognition in human- computer interaction,
R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human- computer interaction,” IEEE Signal processing magazine , vol. 18, no. 1, pp. 32–80, 2001
work page 2001
-
[17]
Describing the emotional states that are expressed in speech,
R. Cowie and R. R. Cornelius, “Describing the emotional states that are expressed in speech,” Speech communication , vol. 40, no. 1-2, pp. 5–32, 2003
work page 2003
-
[18]
Jointly predicting arousal, valence and dominance with multi-task learning,
S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning,” INTERSPEECH, Stock- holm, Sweden, 2017
work page 2017
-
[19]
B. Zhang, E. M. Provost, and G. Essl, “Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences,” IEEE Transactions on Affec- tive Computing, 2017
work page 2017
-
[20]
Prediction-based learning for continuous emotion recognition in speech,
J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Prediction-based learning for continuous emotion recognition in speech,” in Acous- 14 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING tics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5005–5009
work page 2017
-
[21]
Continuous multimodal emotion prediction based on long short term memory recurrent neural network,
J. Huang, Y. Li, J. Tao, Z. Lian, Z. Wen, M. Yang, and J. Yi, “Continuous multimodal emotion prediction based on long short term memory recurrent neural network,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . ACM, 2017, pp. 11–18
work page 2017
-
[22]
Z. Huang, T. Dang, N. Cummins, B. Stasak, P . Le, V . Sethu, and J. Epps, “An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015, pp. 41–48
work page 2015
-
[23]
Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,
S. Mariooryad and C. Busso, “Correcting time-continuous emo- tional labels by modeling the reaction lag of evaluators,” IEEE Transactions on Affective Computing, vol. 6, no. 2, pp. 97–108, 2015
work page 2015
-
[24]
Robust continuous prediction of human emotions using multi- scale dynamic cues,
J. Nicolle, V . Rapp, K. Bailly, L. Prevost, and M. Chetouani, “Robust continuous prediction of human emotions using multi- scale dynamic cues,” in Proceedings of the 14th ACM international conference on Multimodal interaction. ACM, 2012, pp. 501–508
work page 2012
-
[25]
A history of experimental psychology,
E. B ¨oring, “A history of experimental psychology,” New York: Appleton-Century, 1950
work page 1950
-
[26]
Errors of judgement at greenwich in 1796
J. D. Mollon and A. J. Perkins, “Errors of judgement at greenwich in 1796.” Nature, 1996
work page 1996
-
[27]
On the speed of different senses and nerve transmis- sion by hirsch (1862),
S. Nicolas, “On the speed of different senses and nerve transmis- sion by hirsch (1862),” Psychological Research , vol. 59, no. 4, pp. 261–268, 1997
work page 1997
-
[28]
Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations,
S. Mariooryad and C. Busso, “Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations,” in Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 2013, pp. 85–90
work page 2013
-
[29]
Automatic segmenta- tion of spontaneous data using dimensional labels from multiple coders,
M. A. Nicolaou, H. Gunes, and M. Pantic, “Automatic segmenta- tion of spontaneous data using dimensional labels from multiple coders,” in Proc. of LREC Int. Workshop on Multimodal Corpora: Ad- vances in Capturing, Coding and Analyzing Multimodality . Citeseer, 2010, pp. 43–48
work page 2010
-
[30]
Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,
G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nico- laou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5200–5204
work page 2016
-
[31]
M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous predic- tion of spontaneous affect from multiple cues and modalities in valence-arousal space,” IEEE Transactions on Affective Computing , vol. 2, no. 2, pp. 92–105, 2011
work page 2011
-
[32]
S. Khorram, Z. Aldeneh, D. Dimitriadis, M. McInnis, and E. M. Provost, “Capturing long-term temporal dependencies with con- volutional networks for continuous emotion recognition,” Proc. Interspeech 2017, pp. 1253–1257, 2017
work page 2017
-
[33]
Prediction of asynchronous dimen- sional emotion ratings from audiovisual and physiological data,
F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-P . Thiran, T. Ebrahimi, D. Lalanne, and B. Schuller, “Prediction of asynchronous dimen- sional emotion ratings from audiovisual and physiological data,” Pattern Recognition Letters, vol. 66, pp. 22–30, 2015
work page 2015
-
[34]
Discretized continuous speech emotion recognition with multi-task deep recurrent neural network
D. Le, Z. Aldeneh, and E. M. Provost, “Discretized continuous speech emotion recognition with multi-task deep recurrent neural network.” in INTERSPEECH, 2017, pp. 1108–1112
work page 2017
-
[35]
Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions,
F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions,” in International Conference and Workshops on Automatic Face and Gesture Recognition (FG) . IEEE, 2013, pp. 1–8
work page 2013
-
[36]
Avec 2017: real-life depression, and affect recognition workshop and chal- lenge,
F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “Avec 2017: real-life depression, and affect recognition workshop and chal- lenge,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017, pp. 3–9
work page 2017
-
[37]
Multimodal multi-task learning for dimensional and continuous emotion recognition,
S. Chen, Q. Jin, J. Zhao, and S. Wang, “Multimodal multi-task learning for dimensional and continuous emotion recognition,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017, pp. 19–26
work page 2017
-
[38]
Long short-term memory recurrent neural network architectures for large scale acoustic modeling,
H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014
work page 2014
-
[39]
Multi-Scale Context Aggregation by Dilated Convolutions
F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[40]
Learning deconvolution network for semantic segmentation,
H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015, pp. 1520–1528
work page 2015
-
[41]
Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion,
V . Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion,” IEEE transactions on pattern analysis and machine intelligence , vol. 39, no. 12, pp. 2481–2495, 2017
work page 2017
-
[42]
Avec 2018 workshop and challenge: bipolar disorder and cross-cultural affect recognition,
F. Ringeval, B. Schuller, M. Valstar, R. Cowie, H. Kaya, M. Schmitt, S. Amiriparian, N. Cummins, D. Lalanne, A. Michaud et al., “Avec 2018 workshop and challenge: bipolar disorder and cross-cultural affect recognition,” in Proceedings of the 2018 on Audio/Visual Emo- tion Challenge and Workshop. ACM, 2018, pp. 3–13
work page 2018
-
[43]
C. Wang, P . Lopes, T. Pun, and G. Chanel, “Towards a better gold standard: denoising and modelling continuous emotion annota- tions based on feature agglomeration and outlier regularisation,” in Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 2018, pp. 73–81
work page 2018
-
[44]
K. Wataraka Gamage, T. Dang, V . Sethu, J. Epps, and E. Ambikaira- jah, “Speech-based continuous emotion prediction by learning perception responses related to salient events: a study based on vocal affect bursts and cross-cultural affect in avec 2018,” in Pro- ceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 2018, pp. 47–55
work page 2018
-
[45]
Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,
J. Zhao, R. Li, S. Chen, and Q. Jin, “Multi-modal multi-cultural di- mensional continues emotion recognition in dyadic interactions,” in Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 2018, pp. 65–72
work page 2018
-
[46]
L. He, D. Jiang, L. Yang, E. Pei, P . Wu, and H. Sahli, “Multimodal affective dimension prediction using deep bidirectional long short- term memory recurrent neural networks,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge . ACM, 2015, pp. 73–80
work page 2015
-
[47]
Recent develop- ments in opensmile, the munich open-source multimedia feature extractor,
F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent develop- ments in opensmile, the munich open-source multimedia feature extractor,” in Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013, pp. 835–838
work page 2013
-
[48]
Yaafe, an easy to use and efficient audio feature extraction software
B. Mathieu, S. Essid, T. Fillon, J. Prado, and G. Richard, “Yaafe, an easy to use and efficient audio feature extraction software.” in ISMIR, 2010, pp. 441–446
work page 2010
-
[49]
Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,
K. Brady, Y. Gwon, P . Khorrami, E. Godoy, W. Campbell, C. Dagli, and T. S. Huang, “Multi-modal audio, video and physiological sensor learning for continuous emotion prediction,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . ACM, 2016, pp. 97–104
work page 2016
-
[50]
Multimodal emotion recognition for AVEC 2016 challenge,
F. Povolny, P . Matejka, M. Hradis, A. Popkov ´a, L. Otrusina, P . Smrz, I. Wood, C. Robin, and L. Lamel, “Multimodal emotion recognition for AVEC 2016 challenge,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . ACM, 2016, pp. 75–82
work page 2016
-
[51]
The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,
F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y. Devillers, J. Epps, P . Laukka, S. S. Narayanan et al., “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016
work page 2016
-
[52]
M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Decon- volutional networks,” in Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2010, pp. 2528–2535
work page 2010
-
[53]
A guide to convolution arithmetic for deep learning
V . Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[54]
B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge,” Speech Communication, vol. 53, no. 9-10, pp. 1062–1087, 2011
work page 2011
-
[55]
Soundnet: learning sound representations from unlabeled video,
Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: learning sound representations from unlabeled video,” in Advances in Neural In- formation Processing Systems, 2016, pp. 892–900
work page 2016
-
[56]
AVEC 2016: depression, mood, and emotion recognition workshop and challenge,
M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “AVEC 2016: depression, mood, and emotion recognition workshop and challenge,” in Proceedings of the 6th International Workshop on Au- dio/Visual Emotion Challenge. ACM, 2016, pp. 3–10
work page 2016
-
[57]
Openear-introducing the munich open-source emotion and affect recognition toolkit,
F. Eyben, M. W ¨ollmer, and B. Schuller, “Openear-introducing the munich open-source emotion and affect recognition toolkit,” in Affective computing and intelligent interaction and workshops, 2009. ACII 2009. 3rd international conference on . IEEE, 2009, pp. 1–6
work page 2009
-
[58]
L. Muda, M. Begam, and I. Elamvazuthi, “Voice recogni- tion algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques,” arXiv preprint arXiv:1003.4083, 2010. KHORRAM et al.: JOINTL Y ALIGNING AND PREDICTING CONTINUOUS EMOTION ANNOTATIONS 15
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[59]
Recog- nition of depression in bipolar disorder: Leveraging cohort and person-specific knowledge
S. Khorram, J. Gideon, M. G. McInnis, and E. M. Provost, “Recog- nition of depression in bipolar disorder: Leveraging cohort and person-specific knowledge.” in INTERSPEECH, 2016, pp. 1215– 1219
work page 2016
-
[60]
Mel-generalized cepstral analysis-a unified approach to speech spectral estima- tion,
K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalized cepstral analysis-a unified approach to speech spectral estima- tion,” in Third International Conference on Spoken Language Process- ing, 1994
work page 1994
-
[61]
S. Khorram, H. Sameti, F. Bahmaninezhad, S. King, and T. Drug- man, “Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthe- sis,” EURASIP Journal on Audio, Speech, and Music Processing , vol. 2014, no. 1, p. 12, 2014
work page 2014
-
[62]
Soft context clustering for f0 modeling in hmm-based speech synthesis,
S. Khorram, H. Sameti, and S. King, “Soft context clustering for f0 modeling in hmm-based speech synthesis,” EURASIP Journal on Advances in Signal Processing, vol. 2015, no. 1, p. 2, 2015
work page 2015
-
[63]
Speech synthesis based on gaussian conditional random fields,
S. Khorram, F. Bahmaninezhad, and H. Sameti, “Speech synthesis based on gaussian conditional random fields,” in International Symposium on Artificial Intelligence and Signal Processing. Springer, 2013, pp. 183–193
work page 2013
-
[64]
Using neutral speech models for emotional speech analysis
C. Busso, S. Lee, and S. S. Narayanan, “Using neutral speech models for emotional speech analysis.” in Interspeech, 2007, pp. 2225–2228
work page 2007
-
[65]
The priori emotion dataset: Linking mood to emotion detected in-the-wild,
S. Khorram, M. Jaiswal, J. Gideon, M. McInnis, and E.-M. Provost, “The priori emotion dataset: Linking mood to emotion detected in-the-wild,” Proc. Interspeech 2018, pp. 1903–1907, 2018
work page 2018
-
[66]
The Kaldi speech recognition toolkit,
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motlicek, Y. Qian, P . Schwarz et al. , “The Kaldi speech recognition toolkit,” in workshop on automatic speech recognition and understanding (ASRU). IEEE, 2011
work page 2011
-
[67]
Emotion recognition from sponta- neous speech using hidden markov models with deep belief networks,
D. Le and E. M. Provost, “Emotion recognition from sponta- neous speech using hidden markov models with deep belief networks,” in workshop on automatic speech recognition and under- standing (ASRU). IEEE, 2013, pp. 216–221
work page 2013
-
[68]
Adam: A Method for Stochastic Optimization
D. Kingma and J. Ba, “Adam: a method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[69]
S. K. Mitra and Y. Kuo, Digital signal processing: a computer-based approach. McGraw-Hill Higher Education, 2006, vol. 2
work page 2006
-
[70]
Trainable Time Warping: Aligning Time-Series in the Continuous-Time Domain
S. Khorram, M. G. McInnis, and E. M. Provost, “Trainable time warping: Aligning time-series in the continuous-time domain,” arXiv preprint arXiv:1903.09245, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[71]
Automatic detection of laugh- ter,
K. Truong and D. van Leeuwen, “Automatic detection of laugh- ter,” in 9th European Conference on Speech Communication and Tech- nology, 4 September 2005 through 8 September 2005, Lisbon,, 485-488 , 2005
work page 2005
-
[72]
Acoustic analysis of laughter,
C. A. Bickley and S. Hunnicutt, “Acoustic analysis of laughter,” in Second International Conference on Spoken Language Processing , 1992
work page 1992
-
[73]
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
M. Abadi, A. Agarwal, P . Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorflow: large-scale machine learning on heterogeneous distributed sys- tems,” arXiv preprint arXiv:1603.04467, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[74]
Progressive neural networks for transfer learning in emotion recognition,
J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, and E. M. Provost, “Progressive neural networks for transfer learning in emotion recognition,” Proc. Interspeech 2017, pp. 1098–1102, 2017
work page 2017
-
[75]
T. Dang, B. Stasak, Z. Huang, S. Jayawardena, M. Atcheson, M. Hayat, P . Le, V . Sethu, R. Goecke, and J. Epps, “Investigating word affect features and fusion of probabilistic predictions incor- porating uncertainty in avec 2017,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . ACM, 2017, pp. 27– 35
work page 2017
-
[76]
Track- ing changes in continuous emotion states using body language and prosodic cues,
A. Metallinou, A. Katsamanis, Y. Wang, and S. Narayanan, “Track- ing changes in continuous emotion states using body language and prosodic cues,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on . IEEE, 2011, pp. 2288–2291
work page 2011
-
[77]
G. Der and I. J. Deary, “Age and sex differences in reaction time in adulthood: results from the united kingdom health and lifestyle survey.” Psychology and aging, vol. 21, no. 1, p. 62, 2006
work page 2006
-
[78]
Concentration analysis: a quantitative assessment of student states,
L. Bao and E. F. Redish, “Concentration analysis: a quantitative assessment of student states,” American Journal of Physics , vol. 69, no. S1, pp. S45–S53, 2001
work page 2001
-
[79]
Video-based emotion recognition in the wild using deep transfer learning and score fusion,
H. Kaya, F. G ¨urpınar, and A. A. Salah, “Video-based emotion recognition in the wild using deep transfer learning and score fusion,” Image and Vision Computing, vol. 65, pp. 66–75, 2017
work page 2017
-
[80]
Audio-visual emotion recognition in video clips,
F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. An- barjafari, “Audio-visual emotion recognition in video clips,” IEEE Transactions on Affective Computing, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.