Toward using Speech to Sense Student Emotion in Remote Learning Environments

Andr\'e Mayoraz; Bogdan Vlasenko; Egon Werlen; Mathew Magimai.-Doss; Per Bergamin; Sargam Vyas

arxiv: 2604.09881 · v1 · submitted 2026-04-10 · 📡 eess.AS · cs.HC

Toward using Speech to Sense Student Emotion in Remote Learning Environments

Sargam Vyas , Bogdan Vlasenko , Andr\'e Mayoraz , Egon Werlen , Per Bergamin , Mathew Magimai.-Doss This is my paper

Pith reviewed 2026-05-10 15:56 UTC · model grok-4.3

classification 📡 eess.AS cs.HC

keywords remote learningspeech emotionself-control tasksdimensional emotionvalence arousal dominancespontaneous speechparalinguistics

0 comments

The pith

Speech from self-control tasks can indicate students' emotions during remote learning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that speech collected through self-control tasks exhibits emotional variations that can be used to sense how students feel in remote learning settings. Researchers recorded spontaneous monologues as responses to these tasks and ran both human listening tests and machine prediction models for the dimensions of valence, arousal, and dominance. This matters because remote education lacks the emotional signals available in classrooms, so a speech-based method could help make learning more responsive. The results indicate that the tasks do produce usable emotional information from speech.

Core claim

The central claim is that speech acquired through self-control tasks shows perceptible variation along valence, arousal, and dominance dimensions, and that these variations can be automatically predicted from the speech signals.

What carries the argument

A newly developed dataset of spontaneous monologue speech obtained as open responses to self-control tasks, used to study dimensional emotion prediction.

If this is right

Speech-based emotion sensing supports instructional design adjustments in remote learning.
Feedback generation can incorporate paralinguistic information from self-control task speech.
This integration enhances learning experiences in asynchronous environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

One could test whether using these tasks actually leads to improved learning outcomes or student satisfaction.
The approach might extend to other asynchronous communication scenarios beyond education.

Load-bearing premise

Speech variations observed in the tasks mainly come from changes in emotional state instead of task wording, personal speaking habits, or how the audio was recorded.

What would settle it

A follow-up experiment with a larger and more varied group of students that measures emotions independently, for example with questionnaires right after the tasks, and finds no matching pattern in the speech data would show the approach does not work as claimed.

Figures

Figures reproduced from arXiv: 2604.09881 by Andr\'e Mayoraz, Bogdan Vlasenko, Egon Werlen, Mathew Magimai.-Doss, Per Bergamin, Sargam Vyas.

**Figure 3.** Figure 3: Self-Assessment Manikin for valence, arousal, dominance ranges with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of aggregated EWE labels for SPOT-ED database across [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of aggregated EWE labels for VAM database across [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

With advancements in multimodal communication technologies, remote learning environments such as, distance universities are increasing. Remote learning typically happens asynchronously. As a consequence, unlike face-to-face in-person classroom teaching, this lacks availability of sufficient emotional cues for making learning a pleasant experience. Motivated by advances made in the paralinguistic speech processing community on emotion prediction, in this paper we explore use of speech for sensing students' emotions by building upon speech-based self-control tasks developed to aid effective remote learning. More precisely, we investigate: (a) whether speech acquired through self-control tasks exhibit perceptible variation along valence, arousal, and dominance dimensions? and (b) whether those dimensional emotion variations can be automatically predicted? We address these two research questions by developing a dataset containing spontaneous monologue speech acquired as open responses to self-control tasks and by carrying out subjective listener evaluations and automatic dimensional emotion prediction studies on that dataset. Our investigations indicate that speech-based self-control tasks can be a means to sense student emotion in remote learning environment. This opens potential venues to seamlessly integrate paralinguistic speech processing technologies in the remote learning loop for enhancing learning experiences through instructional design and feedback generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a new dataset of spontaneous speech from self-control tasks but does not isolate emotion from content or speaker style.

read the letter

The paper collects spontaneous monologue speech as open responses to self-control tasks in a remote learning setting, then runs listener ratings along valence-arousal-dominance and some standard automatic prediction experiments. The central finding is that perceptible dimensional variation exists and can be modeled to a degree, which they tie to the possibility of sensing student emotion without extra prompts. What is actually new is the dataset itself, built around those specific tasks rather than general or acted speech, plus the direct application to asynchronous remote learning. That is a straightforward extension of existing paralinguistic work into an applied context, and the dual subjective-plus-automatic design is a sensible first step. The motivation for embedding this into existing learning activities is also clear and practical. The soft spot is the attribution step. The claim that speech-based self-control tasks can sense emotion requires that the observed VAD differences stem primarily from emotional state rather than the linguistic content of the answers, individual speaking habits, or recording conditions. The abstract gives no indication of matched neutral tasks, content-independent features, or speaker normalization that would support that separation, so the results could be explained by non-emotional factors. This makes the generalization to instructional design and feedback generation preliminary rather than demonstrated. The work is aimed at people in speech processing or educational technology who are looking for new applied datasets or initial explorations in remote learning. A reader interested in paralinguistic methods for education could extract the dataset and basic setup, but anyone needing robust evidence for emotion sensing would find the current support thin. It deserves peer review because the dataset is original and the questions are well-posed; referees can check the experimental controls, sample details, and numbers that are missing from the abstract.

Referee Report

2 major / 2 minor

Summary. The paper explores the use of speech collected via self-control tasks in remote learning environments to sense student emotions. It develops a new dataset of spontaneous monologue speech as open responses to these tasks, conducts subjective listener evaluations of valence-arousal-dominance (VAD) variations, and performs automatic dimensional emotion prediction experiments. The central claim is that speech-based self-control tasks can serve as a means to sense student emotion, enabling integration of paralinguistic technologies for instructional design and feedback in asynchronous remote learning.

Significance. If the empirical results hold with proper controls and generalization, the work could open avenues for embedding speech emotion recognition into educational platforms to address the lack of emotional cues in remote settings. The approach builds on established paralinguistic methods and introduces a task-specific dataset, which has potential for practical impact in distance education if the attribution to emotion (rather than confounds) is convincingly demonstrated.

major comments (2)

[Dataset collection and experimental design sections] The manuscript provides no description of controls to isolate emotional state from linguistic content of the open responses, individual speaking styles, or recording conditions (e.g., no matched neutral vs. emotional prompts, content-independent features, or speaker-normalized baselines). This is load-bearing for the central claim that observed VAD variations and predictions arise primarily from emotion, as subjective ratings and automatic predictions could succeed for non-emotional reasons.
[Subjective listener evaluations and automatic prediction studies] No sample sizes, participant demographics, number of tasks/responses, or statistical tests (e.g., significance of VAD differences or prediction performance metrics like CCC, RMSE with baselines) are reported. This prevents verification of whether the subjective evaluations and automatic predictions support the generalization that self-control tasks can sense student emotion.

minor comments (2)

[Introduction] The abstract and introduction could more clearly distinguish the self-control tasks from standard emotion elicitation protocols to highlight novelty.
[Automatic prediction experiments] Notation for VAD dimensions and any feature sets used in prediction should be defined consistently if not already standard.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the experimental design description and statistical reporting, which we address point by point below. We have revised the manuscript to incorporate the requested details and clarifications where feasible.

read point-by-point responses

Referee: [Dataset collection and experimental design sections] The manuscript provides no description of controls to isolate emotional state from linguistic content of the open responses, individual speaking styles, or recording conditions (e.g., no matched neutral vs. emotional prompts, content-independent features, or speaker-normalized baselines). This is load-bearing for the central claim that observed VAD variations and predictions arise primarily from emotion, as subjective ratings and automatic predictions could succeed for non-emotional reasons.

Authors: We agree that explicit discussion of controls is necessary to support attribution of VAD variations to emotion rather than confounds. The self-control tasks were intentionally open-ended to capture spontaneous emotional responses tied to the remote learning context, but the original manuscript did not detail mitigation strategies. In the revision, we have added a subsection describing the use of prosodic and spectral features that are relatively content-independent, along with speaker-level normalization applied during feature extraction to account for individual styles. Recording conditions were standardized across participants via the same remote platform and equipment guidelines. We acknowledge that matched neutral prompts were not included and have added this as an explicit limitation with suggestions for future controlled studies. These changes better ground the central claim without altering the core findings. revision: yes
Referee: [Subjective listener evaluations and automatic prediction studies] No sample sizes, participant demographics, number of tasks/responses, or statistical tests (e.g., significance of VAD differences or prediction performance metrics like CCC, RMSE with baselines) are reported. This prevents verification of whether the subjective evaluations and automatic predictions support the generalization that self-control tasks can sense student emotion.

Authors: We apologize for these omissions in the initial version, which are critical for reproducibility and assessment of the results. The revised manuscript now includes the full details: sample sizes and demographics for listeners and speakers, the total number of tasks and responses collected, and the results of statistical tests (including p-values for VAD differences). Automatic prediction results are expanded with CCC, RMSE, and other metrics, plus comparisons against baseline models. These additions directly enable verification and strengthen support for the generalization regarding self-control tasks in remote learning. revision: yes

Circularity Check

0 steps flagged

Empirical study with no circular derivations or self-referential reductions

full rationale

The paper describes an empirical investigation that collects a new dataset of spontaneous monologue speech from self-control tasks, conducts subjective listener evaluations on valence-arousal-dominance dimensions, and runs standard automatic prediction experiments. No equations, derivations, fitted parameters, or self-citations are presented that reduce the central claims (perceptible variation and automatic predictability) to tautologies or inputs by construction. The results rest on new data and external evaluation protocols rather than any self-definitional or load-bearing circular step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The work implicitly relies on the standard domain assumption that paralinguistic speech features carry emotional information.

axioms (1)

domain assumption Speech contains paralinguistic cues that vary with emotional valence, arousal, and dominance
Invoked by reference to advances in the paralinguistic speech processing community.

pith-pipeline@v0.9.0 · 5523 in / 1107 out tokens · 40326 ms · 2026-05-10T15:56:00.851277+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages

[1]

Emotions as drivers of learning and cognitive development,

R. Pekrun, “Emotions as drivers of learning and cognitive development,” inNew perspectives on affect and learning technologies. Springer, 2011, pp. 23–39

work page 2011
[2]

Emotions and learning: cognitive theoretical and methodological approaches to studying the influence of emotions on learning,

M. McConnell and K. W. Eva, “Emotions and learning: cognitive theoretical and methodological approaches to studying the influence of emotions on learning,”Researching medical education, pp. 279–290, 2022

work page 2022
[3]

Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends,

B. W. Schuller, “Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends,”Communications of the ACM, 2018

work page 2018
[4]

Adaptive motivation and emotion in education: Research and principles for instructional design,

L. Linnenbrink-Garcia, E. A. Patall, and R. Pekrun, “Adaptive motivation and emotion in education: Research and principles for instructional design,”Policy Insights from the Behavioral and Brain Sciences, vol. 3, no. 2, pp. 228–236, 2016

work page 2016
[5]

Designing instructional technology from an emotional perspective,

H. Astleitner and D. Leutner, “Designing instructional technology from an emotional perspective,”Journal of research on computing in educa- tion, vol. 32, no. 4, pp. 497–510, 2000

work page 2000
[6]

Kansei engineering concept in instructional design. a novel perspective in guiding the design of instructional materials,

K. M. Chuah, C. J. Chen, and C. S. Teh, “Kansei engineering concept in instructional design. a novel perspective in guiding the design of instructional materials,” inFifth International Cyberspace Conference on Ergonomics, 2008, pp. 1–8

work page 2008
[7]

Empathic design: Imagining the cognitive and emotional learner experience,

M. W. Tracey and A. Hutchinson, “Empathic design: Imagining the cognitive and emotional learner experience,”Educational Technology Research and Development, vol. 67, pp. 1259–1272, 2019

work page 2019
[8]

Exploring the effects of emotional design and emotional intelligence in multimedia- based learning: an engineering educational perspective,

J. A. Kumar, B. Muniandy, and W. A. J. Wan Yahaya, “Exploring the effects of emotional design and emotional intelligence in multimedia- based learning: an engineering educational perspective,”New Review of Hypermedia and Multimedia, vol. 25, no. 1-2, pp. 57–86, 2019

work page 2019
[9]

Promoting students’ emotions and achievement–instructional design and evaluation of the ecole-approach,

M. Gl ¨aser-Zikuda, S. Fuß, M. Laukenmann, K. Metz, and C. Randler, “Promoting students’ emotions and achievement–instructional design and evaluation of the ecole-approach,”Learning and instruction, vol. 15, no. 5, pp. 481–495, 2005

work page 2005
[10]

Oral presentation competence in virtual reality,

C. Hou, “Oral presentation competence in virtual reality,” 10.17615/d5ce-bs51, 2022

work page doi:10.17615/d5ce-bs51 2022
[11]

Positive emotion elicitation in chat-based dialogue systems,

N. Lubis, S. Sakti, K. Yoshino, and S. Nakamura, “Positive emotion elicitation in chat-based dialogue systems,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 4, pp. 866–877, 2019

work page 2019
[12]

Visual design as a holistic experience: How students’ emotional responses to the visual design of instructional materials are formed,

K. Tomita, “Visual design as a holistic experience: How students’ emotional responses to the visual design of instructional materials are formed,”Educational technology research and development, vol. 70, no. 2, pp. 469–502, 2022

work page 2022
[13]

Socio-emotional orientation as a mediating variable in the teaching-learning interaction: Implications for instructional design,

S. Jarvel ¨a, E. Lehtinen, and P. Salonen, “Socio-emotional orientation as a mediating variable in the teaching-learning interaction: Implications for instructional design,”Scandinavian journal of educational research, vol. 44, no. 3, pp. 293–306, 2000. 10

work page 2000
[14]

Experiences with speech input in open questions: A digital innovation to foster self-directed learning,

E. Werlen and D. J. Laubscher, “Experiences with speech input in open questions: A digital innovation to foster self-directed learning,” in Empowering self-directed learner through digital innovation. AOSIS, 2025

work page 2025
[15]

Squeezing lemons - emotions in impersonal non-fiction texts. emotional analyses of students’ texts,

E. Werlen, T. Panar, B. Parsaeifard, and C. Imhof, “Squeezing lemons - emotions in impersonal non-fiction texts. emotional analyses of students’ texts,” inEARLI SIG27 Conference 2022: Online measures at the crossroad of ethical and methodological challenges. Southampton, UK, 30 August – 1 September 2022., 2022

work page 2022
[16]

Multi-modal emotion recognition from speech and text,

Z.-J. Chuang and C.-H. Wu, “Multi-modal emotion recognition from speech and text,” inInternational Journal of Computational Linguistics & Chinese Language Processing, V olume 9, Number 2, August 2004: Special Issue on New Trends of Speech and Language Processing, 2004, pp. 45–62

work page 2004
[17]

Emotion Recognition and Generation: A Comprehensive Review of Face, Speech, and Text Modalities.arXiv preprint arXiv:2502.06803

R. Mobbs, D. Makris, and V . Argyriou, “Emotion recognition and gen- eration: A comprehensive review of face, speech, and text modalities,” arXiv preprint arXiv:2502.06803, 2025

work page arXiv 2025
[18]

Desperately seeking emotions or: Actors, wizards, and human beings,

A. Batliner, K. Fischer, R. Huber, J. Spilker, and E. N ¨oth, “Desperately seeking emotions or: Actors, wizards, and human beings,” inISCA tutorial and research workshop (ITRW) on speech and emotion, 2000

work page 2000
[19]

Categorical and dimensional affect analysis in continuous input: Current trends and future directions,

H. Gunes and B. Schuller, “Categorical and dimensional affect analysis in continuous input: Current trends and future directions,”Image and Vision Computing, vol. 31, no. 2, pp. 120–136, 2013

work page 2013
[20]

A database of German emotional speech

F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weisset al., “A database of German emotional speech.” inInterspeech, vol. 5, 2005, pp. 1517–1520

work page 2005
[21]

Ekman,Are there basic emotions?American Psychological Associ- ation, 1992

P. Ekman,Are there basic emotions?American Psychological Associ- ation, 1992

work page 1992
[22]

SmartKom: Multimodal communication with a life-like character,

W. Wahlster, N. Reithinger, and A. Blocher, “SmartKom: Multimodal communication with a life-like character,” inSeventh European Confer- ence on Speech Communication and Technology, 2001

work page 2001
[23]

Universal methods of design: 100 ways to research complex problems,

B. Martin, B. Hanington, and B. M. Hanington, “Universal methods of design: 100 ways to research complex problems,”Develop Innovative Ideas, and Design Effective Solutions, pp. 12–13, 2012

work page 2012
[24]

Heading toward to the natural way of human-machine interaction: the NIMITEK project,

B. Vlasenko and A. Wendemuth, “Heading toward to the natural way of human-machine interaction: the NIMITEK project,” in2009 IEEE International Conference on Multimedia and Expo. IEEE, 2009, pp. 950–953

work page 2009
[25]

Social functions of emotions at four levels of analysis,

D. Keltner and J. Haidt, “Social functions of emotions at four levels of analysis,”Cognition & Emotion, vol. 13, no. 5, pp. 505–521, 1999

work page 1999
[26]

Interper- sonal deception: IV . Effects of suspicion on perceived communication and nonverbal behavior dynamics,

J. K. Burgoon, D. B. Buller, L. Dillman, and J. B. Walther, “Interper- sonal deception: IV . Effects of suspicion on perceived communication and nonverbal behavior dynamics,”Human Communication Research, vol. 22, no. 2, pp. 163–196, 1995

work page 1995
[27]

Exploring cross-modality affective re- actions for audiovisual emotion recognition,

S. Mariooryad and C. Busso, “Exploring cross-modality affective re- actions for audiovisual emotion recognition,”IEEE Transactions on affective computing, vol. 4, no. 2, pp. 183–196, 2013

work page 2013
[28]

Online and face-to-face classroom multitasking and academic performance: Moderated mediation with self-efficacy for self-regulated learning and gender,

A. Alghamdi, A. C. Karpinski, A. Lepp, and J. Barkley, “Online and face-to-face classroom multitasking and academic performance: Moderated mediation with self-efficacy for self-regulated learning and gender,”Computers in Human Behavior, vol. 102, pp. 214–222, 2020

work page 2020
[29]

Constructing a spo- ken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics,

H. Mori, T. Satake, M. Nakamura, and H. Kasuya, “Constructing a spo- ken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics,”Speech Communication, vol. 53, no. 1, pp. 36–50, 2011

work page 2011
[30]

Comparing the roles of textual, acoustic and spoken-language features on spontaneous-conversation summarization,

X. Zhu and G. Penn, “Comparing the roles of textual, acoustic and spoken-language features on spontaneous-conversation summarization,” inProceedings of the Human Language Technology Conference of the NAACL, Companion V olume: Short Papers, 2006, pp. 197–200

work page 2006
[31]

An argument for basic emotions,

P. Ekman, “An argument for basic emotions,”Cognition & emotion, vol. 6, no. 3-4, pp. 169–200, 1992

work page 1992
[32]

Evidence for a three-factor theory of emotions,

J. A. Russell and A. Mehrabian, “Evidence for a three-factor theory of emotions,”Journal of research in Personality, vol. 11, no. 3, pp. 273– 294, 1977

work page 1977
[33]

The vera am mittag german audio-visual emotional speech database,

M. Grimmet al., “The vera am mittag german audio-visual emotional speech database,” inProc. ICME. IEEE, 2008, pp. 865–868

work page 2008
[34]

Iemocap: Interactive emotional dyadic motion capture database,

C. Bussoet al., “Iemocap: Interactive emotional dyadic motion capture database,”Language resources and evaluation, 2008

work page 2008
[35]

Introducing the recola multimodal corpus of remote collaborative and affective interac- tions,

F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the recola multimodal corpus of remote collaborative and affective interac- tions,” in2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2013, pp. 1–8

work page 2013
[36]

’FEELTRACE’: An instrument for recording per- ceived emotion in real time,

R. Cowie, E. Douglas-Cowie, S. Savvidou*, E. McMahon, M. Sawey, and M. Schr ¨oder, “’FEELTRACE’: An instrument for recording per- ceived emotion in real time,” inISCA tutorial and research workshop (ITRW) on speech and emotion, 2000

work page 2000
[37]

Gtrace: General trace program compatible with EmotionML,

R. Cowie, M. Sawey, C. Doherty, J. Jaimovich, C. Fyans, and P. Sta- pleton, “Gtrace: General trace program compatible with EmotionML,” in2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 2013, pp. 709–710

work page 2013
[38]

MSP-IMPROV: An acted corpus of dyadic inter- actions to study emotion perception,

C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost, “MSP-IMPROV: An acted corpus of dyadic inter- actions to study emotion perception,”IEEE Transactions on Affective Computing, 2017

work page 2017
[39]

Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,

R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,”IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, October-December 2019

work page 2019
[40]

UMEME: University of michigan emotional mcgurk effect data set,

E. M. Provost, Y . Shangguan, and C. Busso, “UMEME: University of michigan emotional mcgurk effect data set,”IEEE Transactions on Affective Computing, vol. 6, no. 4, pp. 395–409, 2015

work page 2015
[41]

Crema-d: Crowd-sourced emotional multimodal actors dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,”IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014

work page 2014
[42]

Tarasov, S

A. Tarasov, S. J. Delany, and C. Cullen,Using crowdsourcing for labelling emotional speech assets. Technological University Dublin, 2010

work page 2010
[43]

Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks,

R. Snow, B. O’connor, D. Jurafsky, and A. Y . Ng, “Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks,” inProceedings of the 2008 conference on empirical methods in natural language processing, 2008, pp. 254–263

work page 2008
[44]

Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,

R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,”IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 2017

work page 2017
[45]

Salman, Wei-Cheng Lin, and others

C. Busso, R. Lotfian, K. Sridhar, A. N. Salman, W.-C. Lin, L. Goncalves, S. Parthasarathy, A. R. Naini, S.-G. Leem, L. Martinez-Lucaset al., “The msp-podcast corpus,”arXiv preprint arXiv:2509.09791, 2025

work page arXiv 2025
[46]

Imaps: A smart phone based real-time framework for prediction of affect in natural dyadic conversation,

A. M. Rahman, M. I. Tanveer, A. I. Anam, and M. Yeasin, “Imaps: A smart phone based real-time framework for prediction of affect in natural dyadic conversation,” in2012 Visual Communications and Image Processing. IEEE, 2012, pp. 1–6

work page 2012
[47]

On the acoustics of emotion in audio: what speech, music, and sound have in common,

F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro, and K. R. Scherer, “On the acoustics of emotion in audio: what speech, music, and sound have in common,”Frontiers in psychology, vol. 4, p. 292, 2013

work page 2013
[48]

Audio-visual feature- decision level fusion for spontaneous emotion estimation in speech conversations,

A. Sayedelahl, R. Araujo, and M. S. Kamel, “Audio-visual feature- decision level fusion for spontaneous emotion estimation in speech conversations,” in2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). IEEE, 2013, pp. 1–6

work page 2013
[49]

Jointly predicting arousal, valence and dominance with multi-task learning

S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning.” inInterspeech, vol. 2017, 2017, pp. 1103–1107

work page 2017
[50]

Towards robust speech emotion recognition using deep resid- ual networks for speech enhancement,

A. Triantafyllopoulos, G. Keren, J. Wagner, I. Steiner, and B. W. Schuller, “Towards robust speech emotion recognition using deep resid- ual networks for speech enhancement,” inInterspeech 2019, 2019, pp. 1691–1695

work page 2019
[51]

Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap,

J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 745– 10 759, 2023

work page 2023
[52]

Comparing data-driven and handcrafted features for dimensional emotion recognition,

B. Vlasenko, S. Vyas, and M. M. Doss, “Comparing data-driven and handcrafted features for dimensional emotion recognition,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 841–11 845

work page 2024
[53]

Opel: Online prompting in elearning. a new tool to foster skills and knowledge,

E. Werlen and P. Bergamin, “Opel: Online prompting in elearning. a new tool to foster skills and knowledge,” inEuropean Conference on Education Research - ECER 2014: ”The Past, the Present and Future of Educational Research in Europe”. EERA, 2014

work page 2014
[54]

Self-control tasks with self-explanation prompts as a component of self-directed online learning,

E. Werlen, V . Mirata, D. Jagals, and N. Bergamin, “Self-control tasks with self-explanation prompts as a component of self-directed online learning,” inBlended learning environments to foster selfdirected learn- ing.AOSIS, 2021, pp. 199–227

work page 2021
[55]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” inIEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011

work page 2011
[56]

Montreal forced aligner: Trainable text-speech alignment using Kaldi

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using Kaldi.” inProc. of Interspeech, 2017. 11

work page 2017
[57]

Comparing the performance of forced aligners used in sociophonetic research,

S. Gonzalez, J. Grama, and C. E. Travis, “Comparing the performance of forced aligners used in sociophonetic research,”Linguistics V anguard, vol. 6, no. 1, p. 20190058, 2020

work page 2020
[58]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019

work page 2019
[59]

Training a broad-coverage german sentiment classification model for dialog sys- tems,

O. Guhr, A.-K. Schumann, F. Bahrmann, and H. J. B ¨ohme, “Training a broad-coverage german sentiment classification model for dialog sys- tems,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 1627–1632

work page 2020
[60]

Measuring emotion: The self- assessment manikin and the semantic differential,

M. M. Bradley and P. J. Lang, “Measuring emotion: The self- assessment manikin and the semantic differential,”Journal of Behavior Therapy and Experimental Psychiatry, vol. 25, no. 1, p. 49–59, 1994. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ 0005791694900639

work page 1994
[61]

Gender biases in the training methods of affective computing: Redesign and validation of the self-assessment manikin in measuring emotions via audiovisual clips,

C. Sainz-de Baranda Andujar, L. Guti ´errez-Mart´ın, J. ´A. Miranda- Calero, M. Blanco-Ruiz, and C. L ´opez-Ongil, “Gender biases in the training methods of affective computing: Redesign and validation of the self-assessment manikin in measuring emotions via audiovisual clips,” Frontiers in psychology, vol. 13, p. 955530, 2022

work page 2022
[62]

Evaluation of natural emotions using self assessment manikins,

M. Grimm and K. Kroschel, “Evaluation of natural emotions using self assessment manikins,” inIEEE Workshop on Automatic Speech Recognition and Understanding, 2005., 2005, pp. 381–385

work page 2005
[63]

Primitives-based evaluation and estimation of emo- tions in speech,

M. Grimmet al., “Primitives-based evaluation and estimation of emo- tions in speech,”Speech communication, vol. 49, no. 10-11, pp. 787–800, 2007

work page 2007
[64]

The INTER- SPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,

B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchiet al., “The INTER- SPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,” inProceedings INTERSPEECH 2013, Lyon, France, 2013

work page 2013
[65]

Opensmile: the Munich versatile and fast open-source audio feature extractor,

F. Eybenet al., “Opensmile: the Munich versatile and fast open-source audio feature extractor,” inProc. ACMM MM, 2010, pp. 1459–1462

work page 2010
[66]

SUPERB: Speech processing Universal PERfor- mance Benchmark,

S. wen Yanget al., “SUPERB: Speech processing Universal PERfor- mance Benchmark,” inProc. Interspeech, 2021, pp. 1194–1198

work page 2021
[67]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsuet al., “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

work page 2021
[68]

WavLM: Large-scale self-supervised pre-training for full stack speech processing,

S. Chenet al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:239885872

work page 2021
[69]

Model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0

J. Wagneret al., “Model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0.” [Online]. Available: https://doi.org/10.5281/ zenodo.6221127

work page

[1] [1]

Emotions as drivers of learning and cognitive development,

R. Pekrun, “Emotions as drivers of learning and cognitive development,” inNew perspectives on affect and learning technologies. Springer, 2011, pp. 23–39

work page 2011

[2] [2]

Emotions and learning: cognitive theoretical and methodological approaches to studying the influence of emotions on learning,

M. McConnell and K. W. Eva, “Emotions and learning: cognitive theoretical and methodological approaches to studying the influence of emotions on learning,”Researching medical education, pp. 279–290, 2022

work page 2022

[3] [3]

Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends,

B. W. Schuller, “Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends,”Communications of the ACM, 2018

work page 2018

[4] [4]

Adaptive motivation and emotion in education: Research and principles for instructional design,

L. Linnenbrink-Garcia, E. A. Patall, and R. Pekrun, “Adaptive motivation and emotion in education: Research and principles for instructional design,”Policy Insights from the Behavioral and Brain Sciences, vol. 3, no. 2, pp. 228–236, 2016

work page 2016

[5] [5]

Designing instructional technology from an emotional perspective,

H. Astleitner and D. Leutner, “Designing instructional technology from an emotional perspective,”Journal of research on computing in educa- tion, vol. 32, no. 4, pp. 497–510, 2000

work page 2000

[6] [6]

Kansei engineering concept in instructional design. a novel perspective in guiding the design of instructional materials,

K. M. Chuah, C. J. Chen, and C. S. Teh, “Kansei engineering concept in instructional design. a novel perspective in guiding the design of instructional materials,” inFifth International Cyberspace Conference on Ergonomics, 2008, pp. 1–8

work page 2008

[7] [7]

Empathic design: Imagining the cognitive and emotional learner experience,

M. W. Tracey and A. Hutchinson, “Empathic design: Imagining the cognitive and emotional learner experience,”Educational Technology Research and Development, vol. 67, pp. 1259–1272, 2019

work page 2019

[8] [8]

Exploring the effects of emotional design and emotional intelligence in multimedia- based learning: an engineering educational perspective,

J. A. Kumar, B. Muniandy, and W. A. J. Wan Yahaya, “Exploring the effects of emotional design and emotional intelligence in multimedia- based learning: an engineering educational perspective,”New Review of Hypermedia and Multimedia, vol. 25, no. 1-2, pp. 57–86, 2019

work page 2019

[9] [9]

Promoting students’ emotions and achievement–instructional design and evaluation of the ecole-approach,

M. Gl ¨aser-Zikuda, S. Fuß, M. Laukenmann, K. Metz, and C. Randler, “Promoting students’ emotions and achievement–instructional design and evaluation of the ecole-approach,”Learning and instruction, vol. 15, no. 5, pp. 481–495, 2005

work page 2005

[10] [10]

Oral presentation competence in virtual reality,

C. Hou, “Oral presentation competence in virtual reality,” 10.17615/d5ce-bs51, 2022

work page doi:10.17615/d5ce-bs51 2022

[11] [11]

Positive emotion elicitation in chat-based dialogue systems,

N. Lubis, S. Sakti, K. Yoshino, and S. Nakamura, “Positive emotion elicitation in chat-based dialogue systems,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 4, pp. 866–877, 2019

work page 2019

[12] [12]

Visual design as a holistic experience: How students’ emotional responses to the visual design of instructional materials are formed,

K. Tomita, “Visual design as a holistic experience: How students’ emotional responses to the visual design of instructional materials are formed,”Educational technology research and development, vol. 70, no. 2, pp. 469–502, 2022

work page 2022

[13] [13]

Socio-emotional orientation as a mediating variable in the teaching-learning interaction: Implications for instructional design,

S. Jarvel ¨a, E. Lehtinen, and P. Salonen, “Socio-emotional orientation as a mediating variable in the teaching-learning interaction: Implications for instructional design,”Scandinavian journal of educational research, vol. 44, no. 3, pp. 293–306, 2000. 10

work page 2000

[14] [14]

Experiences with speech input in open questions: A digital innovation to foster self-directed learning,

E. Werlen and D. J. Laubscher, “Experiences with speech input in open questions: A digital innovation to foster self-directed learning,” in Empowering self-directed learner through digital innovation. AOSIS, 2025

work page 2025

[15] [15]

Squeezing lemons - emotions in impersonal non-fiction texts. emotional analyses of students’ texts,

E. Werlen, T. Panar, B. Parsaeifard, and C. Imhof, “Squeezing lemons - emotions in impersonal non-fiction texts. emotional analyses of students’ texts,” inEARLI SIG27 Conference 2022: Online measures at the crossroad of ethical and methodological challenges. Southampton, UK, 30 August – 1 September 2022., 2022

work page 2022

[16] [16]

Multi-modal emotion recognition from speech and text,

Z.-J. Chuang and C.-H. Wu, “Multi-modal emotion recognition from speech and text,” inInternational Journal of Computational Linguistics & Chinese Language Processing, V olume 9, Number 2, August 2004: Special Issue on New Trends of Speech and Language Processing, 2004, pp. 45–62

work page 2004

[17] [17]

Emotion Recognition and Generation: A Comprehensive Review of Face, Speech, and Text Modalities.arXiv preprint arXiv:2502.06803

R. Mobbs, D. Makris, and V . Argyriou, “Emotion recognition and gen- eration: A comprehensive review of face, speech, and text modalities,” arXiv preprint arXiv:2502.06803, 2025

work page arXiv 2025

[18] [18]

Desperately seeking emotions or: Actors, wizards, and human beings,

A. Batliner, K. Fischer, R. Huber, J. Spilker, and E. N ¨oth, “Desperately seeking emotions or: Actors, wizards, and human beings,” inISCA tutorial and research workshop (ITRW) on speech and emotion, 2000

work page 2000

[19] [19]

Categorical and dimensional affect analysis in continuous input: Current trends and future directions,

H. Gunes and B. Schuller, “Categorical and dimensional affect analysis in continuous input: Current trends and future directions,”Image and Vision Computing, vol. 31, no. 2, pp. 120–136, 2013

work page 2013

[20] [20]

A database of German emotional speech

F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weisset al., “A database of German emotional speech.” inInterspeech, vol. 5, 2005, pp. 1517–1520

work page 2005

[21] [21]

Ekman,Are there basic emotions?American Psychological Associ- ation, 1992

P. Ekman,Are there basic emotions?American Psychological Associ- ation, 1992

work page 1992

[22] [22]

SmartKom: Multimodal communication with a life-like character,

W. Wahlster, N. Reithinger, and A. Blocher, “SmartKom: Multimodal communication with a life-like character,” inSeventh European Confer- ence on Speech Communication and Technology, 2001

work page 2001

[23] [23]

Universal methods of design: 100 ways to research complex problems,

B. Martin, B. Hanington, and B. M. Hanington, “Universal methods of design: 100 ways to research complex problems,”Develop Innovative Ideas, and Design Effective Solutions, pp. 12–13, 2012

work page 2012

[24] [24]

Heading toward to the natural way of human-machine interaction: the NIMITEK project,

B. Vlasenko and A. Wendemuth, “Heading toward to the natural way of human-machine interaction: the NIMITEK project,” in2009 IEEE International Conference on Multimedia and Expo. IEEE, 2009, pp. 950–953

work page 2009

[25] [25]

Social functions of emotions at four levels of analysis,

D. Keltner and J. Haidt, “Social functions of emotions at four levels of analysis,”Cognition & Emotion, vol. 13, no. 5, pp. 505–521, 1999

work page 1999

[26] [26]

Interper- sonal deception: IV . Effects of suspicion on perceived communication and nonverbal behavior dynamics,

J. K. Burgoon, D. B. Buller, L. Dillman, and J. B. Walther, “Interper- sonal deception: IV . Effects of suspicion on perceived communication and nonverbal behavior dynamics,”Human Communication Research, vol. 22, no. 2, pp. 163–196, 1995

work page 1995

[27] [27]

Exploring cross-modality affective re- actions for audiovisual emotion recognition,

S. Mariooryad and C. Busso, “Exploring cross-modality affective re- actions for audiovisual emotion recognition,”IEEE Transactions on affective computing, vol. 4, no. 2, pp. 183–196, 2013

work page 2013

[28] [28]

Online and face-to-face classroom multitasking and academic performance: Moderated mediation with self-efficacy for self-regulated learning and gender,

A. Alghamdi, A. C. Karpinski, A. Lepp, and J. Barkley, “Online and face-to-face classroom multitasking and academic performance: Moderated mediation with self-efficacy for self-regulated learning and gender,”Computers in Human Behavior, vol. 102, pp. 214–222, 2020

work page 2020

[29] [29]

Constructing a spo- ken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics,

H. Mori, T. Satake, M. Nakamura, and H. Kasuya, “Constructing a spo- ken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics,”Speech Communication, vol. 53, no. 1, pp. 36–50, 2011

work page 2011

[30] [30]

Comparing the roles of textual, acoustic and spoken-language features on spontaneous-conversation summarization,

X. Zhu and G. Penn, “Comparing the roles of textual, acoustic and spoken-language features on spontaneous-conversation summarization,” inProceedings of the Human Language Technology Conference of the NAACL, Companion V olume: Short Papers, 2006, pp. 197–200

work page 2006

[31] [31]

An argument for basic emotions,

P. Ekman, “An argument for basic emotions,”Cognition & emotion, vol. 6, no. 3-4, pp. 169–200, 1992

work page 1992

[32] [32]

Evidence for a three-factor theory of emotions,

J. A. Russell and A. Mehrabian, “Evidence for a three-factor theory of emotions,”Journal of research in Personality, vol. 11, no. 3, pp. 273– 294, 1977

work page 1977

[33] [33]

The vera am mittag german audio-visual emotional speech database,

M. Grimmet al., “The vera am mittag german audio-visual emotional speech database,” inProc. ICME. IEEE, 2008, pp. 865–868

work page 2008

[34] [34]

Iemocap: Interactive emotional dyadic motion capture database,

C. Bussoet al., “Iemocap: Interactive emotional dyadic motion capture database,”Language resources and evaluation, 2008

work page 2008

[35] [35]

Introducing the recola multimodal corpus of remote collaborative and affective interac- tions,

F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the recola multimodal corpus of remote collaborative and affective interac- tions,” in2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2013, pp. 1–8

work page 2013

[36] [36]

’FEELTRACE’: An instrument for recording per- ceived emotion in real time,

R. Cowie, E. Douglas-Cowie, S. Savvidou*, E. McMahon, M. Sawey, and M. Schr ¨oder, “’FEELTRACE’: An instrument for recording per- ceived emotion in real time,” inISCA tutorial and research workshop (ITRW) on speech and emotion, 2000

work page 2000

[37] [37]

Gtrace: General trace program compatible with EmotionML,

R. Cowie, M. Sawey, C. Doherty, J. Jaimovich, C. Fyans, and P. Sta- pleton, “Gtrace: General trace program compatible with EmotionML,” in2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 2013, pp. 709–710

work page 2013

[38] [38]

MSP-IMPROV: An acted corpus of dyadic inter- actions to study emotion perception,

C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost, “MSP-IMPROV: An acted corpus of dyadic inter- actions to study emotion perception,”IEEE Transactions on Affective Computing, 2017

work page 2017

[39] [39]

Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,

R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,”IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, October-December 2019

work page 2019

[40] [40]

UMEME: University of michigan emotional mcgurk effect data set,

E. M. Provost, Y . Shangguan, and C. Busso, “UMEME: University of michigan emotional mcgurk effect data set,”IEEE Transactions on Affective Computing, vol. 6, no. 4, pp. 395–409, 2015

work page 2015

[41] [41]

Crema-d: Crowd-sourced emotional multimodal actors dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,”IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014

work page 2014

[42] [42]

Tarasov, S

A. Tarasov, S. J. Delany, and C. Cullen,Using crowdsourcing for labelling emotional speech assets. Technological University Dublin, 2010

work page 2010

[43] [43]

Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks,

R. Snow, B. O’connor, D. Jurafsky, and A. Y . Ng, “Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks,” inProceedings of the 2008 conference on empirical methods in natural language processing, 2008, pp. 254–263

work page 2008

[44] [44]

Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,

R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,”IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 2017

work page 2017

[45] [45]

Salman, Wei-Cheng Lin, and others

C. Busso, R. Lotfian, K. Sridhar, A. N. Salman, W.-C. Lin, L. Goncalves, S. Parthasarathy, A. R. Naini, S.-G. Leem, L. Martinez-Lucaset al., “The msp-podcast corpus,”arXiv preprint arXiv:2509.09791, 2025

work page arXiv 2025

[46] [46]

Imaps: A smart phone based real-time framework for prediction of affect in natural dyadic conversation,

A. M. Rahman, M. I. Tanveer, A. I. Anam, and M. Yeasin, “Imaps: A smart phone based real-time framework for prediction of affect in natural dyadic conversation,” in2012 Visual Communications and Image Processing. IEEE, 2012, pp. 1–6

work page 2012

[47] [47]

On the acoustics of emotion in audio: what speech, music, and sound have in common,

F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro, and K. R. Scherer, “On the acoustics of emotion in audio: what speech, music, and sound have in common,”Frontiers in psychology, vol. 4, p. 292, 2013

work page 2013

[48] [48]

Audio-visual feature- decision level fusion for spontaneous emotion estimation in speech conversations,

A. Sayedelahl, R. Araujo, and M. S. Kamel, “Audio-visual feature- decision level fusion for spontaneous emotion estimation in speech conversations,” in2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). IEEE, 2013, pp. 1–6

work page 2013

[49] [49]

Jointly predicting arousal, valence and dominance with multi-task learning

S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning.” inInterspeech, vol. 2017, 2017, pp. 1103–1107

work page 2017

[50] [50]

Towards robust speech emotion recognition using deep resid- ual networks for speech enhancement,

A. Triantafyllopoulos, G. Keren, J. Wagner, I. Steiner, and B. W. Schuller, “Towards robust speech emotion recognition using deep resid- ual networks for speech enhancement,” inInterspeech 2019, 2019, pp. 1691–1695

work page 2019

[51] [51]

Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap,

J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 745– 10 759, 2023

work page 2023

[52] [52]

Comparing data-driven and handcrafted features for dimensional emotion recognition,

B. Vlasenko, S. Vyas, and M. M. Doss, “Comparing data-driven and handcrafted features for dimensional emotion recognition,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 841–11 845

work page 2024

[53] [53]

Opel: Online prompting in elearning. a new tool to foster skills and knowledge,

E. Werlen and P. Bergamin, “Opel: Online prompting in elearning. a new tool to foster skills and knowledge,” inEuropean Conference on Education Research - ECER 2014: ”The Past, the Present and Future of Educational Research in Europe”. EERA, 2014

work page 2014

[54] [54]

Self-control tasks with self-explanation prompts as a component of self-directed online learning,

E. Werlen, V . Mirata, D. Jagals, and N. Bergamin, “Self-control tasks with self-explanation prompts as a component of self-directed online learning,” inBlended learning environments to foster selfdirected learn- ing.AOSIS, 2021, pp. 199–227

work page 2021

[55] [55]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” inIEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011

work page 2011

[56] [56]

Montreal forced aligner: Trainable text-speech alignment using Kaldi

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using Kaldi.” inProc. of Interspeech, 2017. 11

work page 2017

[57] [57]

Comparing the performance of forced aligners used in sociophonetic research,

S. Gonzalez, J. Grama, and C. E. Travis, “Comparing the performance of forced aligners used in sociophonetic research,”Linguistics V anguard, vol. 6, no. 1, p. 20190058, 2020

work page 2020

[58] [58]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019

work page 2019

[59] [59]

Training a broad-coverage german sentiment classification model for dialog sys- tems,

O. Guhr, A.-K. Schumann, F. Bahrmann, and H. J. B ¨ohme, “Training a broad-coverage german sentiment classification model for dialog sys- tems,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 1627–1632

work page 2020

[60] [60]

Measuring emotion: The self- assessment manikin and the semantic differential,

M. M. Bradley and P. J. Lang, “Measuring emotion: The self- assessment manikin and the semantic differential,”Journal of Behavior Therapy and Experimental Psychiatry, vol. 25, no. 1, p. 49–59, 1994. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ 0005791694900639

work page 1994

[61] [61]

Gender biases in the training methods of affective computing: Redesign and validation of the self-assessment manikin in measuring emotions via audiovisual clips,

C. Sainz-de Baranda Andujar, L. Guti ´errez-Mart´ın, J. ´A. Miranda- Calero, M. Blanco-Ruiz, and C. L ´opez-Ongil, “Gender biases in the training methods of affective computing: Redesign and validation of the self-assessment manikin in measuring emotions via audiovisual clips,” Frontiers in psychology, vol. 13, p. 955530, 2022

work page 2022

[62] [62]

Evaluation of natural emotions using self assessment manikins,

M. Grimm and K. Kroschel, “Evaluation of natural emotions using self assessment manikins,” inIEEE Workshop on Automatic Speech Recognition and Understanding, 2005., 2005, pp. 381–385

work page 2005

[63] [63]

Primitives-based evaluation and estimation of emo- tions in speech,

M. Grimmet al., “Primitives-based evaluation and estimation of emo- tions in speech,”Speech communication, vol. 49, no. 10-11, pp. 787–800, 2007

work page 2007

[64] [64]

The INTER- SPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,

B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchiet al., “The INTER- SPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,” inProceedings INTERSPEECH 2013, Lyon, France, 2013

work page 2013

[65] [65]

Opensmile: the Munich versatile and fast open-source audio feature extractor,

F. Eybenet al., “Opensmile: the Munich versatile and fast open-source audio feature extractor,” inProc. ACMM MM, 2010, pp. 1459–1462

work page 2010

[66] [66]

SUPERB: Speech processing Universal PERfor- mance Benchmark,

S. wen Yanget al., “SUPERB: Speech processing Universal PERfor- mance Benchmark,” inProc. Interspeech, 2021, pp. 1194–1198

work page 2021

[67] [67]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsuet al., “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

work page 2021

[68] [68]

WavLM: Large-scale self-supervised pre-training for full stack speech processing,

S. Chenet al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:239885872

work page 2021

[69] [69]

Model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0

J. Wagneret al., “Model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0.” [Online]. Available: https://doi.org/10.5281/ zenodo.6221127

work page