pith. sign in

arxiv: 2305.18640 · v1 · pith:XH43SOJMnew · submitted 2023-05-29 · 📡 eess.AS

Transforming the Embeddings: A Lightweight Technique for Speech Emotion Recognition Tasks

classification 📡 eess.AS
keywords embeddingsrecognitionspeakerspeechapproachemotionfeaturesinput
0
0 comments X
read the original abstract

Speech emotion recognition (SER) is a field that has drawn a lot of attention due to its applications in diverse fields. A current trend in methods used for SER is to leverage embeddings from pre-trained models (PTMs) as input features to downstream models. However, the use of embeddings from speaker recognition PTMs hasn't garnered much focus in comparison to other PTM embeddings. To fill this gap and in order to understand the efficacy of speaker recognition PTM embeddings, we perform a comparative analysis of five PTM embeddings. Among all, x-vector embeddings performed the best possibly due to its training for speaker recognition leading to capturing various components of speech such as tone, pitch, etc. Our modeling approach which utilizes x-vector embeddings and mel-frequency cepstral coefficients (MFCC) as input features is the most lightweight approach while achieving comparable accuracy to previous state-of-the-art (SOTA) methods in the CREMA-D benchmark.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Impact Analysis of Speech Representation Learning Models for Acoustic Side-Channel Attack

    cs.CR 2026-06 unverdicted novelty 5.0

    KEYAC dataset created; KAN fine-tuning achieves SOTA on acoustic side-channel keystroke recognition from speech representations under zero-shot and partial fine-tuning.

  2. Impact Analysis of Speech Representation Learning Models for Acoustic Side-Channel Attack

    cs.CR 2026-06 unverdicted novelty 5.0

    KEYAC dataset benchmarks speech models for keyboard acoustic side-channel attacks, with KAN fine-tuning setting new SOTA by addressing nonlinear feature interactions.