Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Oussama Mustapha Benouddane; Samiya Silarbi; Youcef Soufiane Gheffari

arxiv: 2604.07357 · v1 · submitted 2026-03-28 · 💻 cs.CL · cs.AI· cs.SD

Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Youcef Soufiane Gheffari , Oussama Mustapha Benouddane , Samiya Silarbi This is my paper

Pith reviewed 2026-05-14 22:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SD

keywords Arabic speech emotion recognitionhybrid CNN-TransformerMel-spectrogramEYASE corpusattention mechanismslow-resource languagetemporal modeling

0 comments

The pith

A hybrid CNN-Transformer model achieves 97.8% accuracy in Arabic speech emotion recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a hybrid architecture that pairs convolutional layers with Transformer encoders to recognize emotions from Arabic speech. Convolutional layers pull local spectral patterns out of Mel-spectrogram inputs while the Transformer component models longer temporal sequences across the utterance. Experiments on the EYASE Egyptian Arabic corpus produce 97.8% accuracy and a 0.98 macro F1-score. Readers would care because Arabic has far fewer annotated emotion datasets than English or German, so a working high-accuracy system demonstrates that attention-based hybrids can still deliver strong results in low-resource language settings.

Core claim

The central claim is that convolutional layers extract discriminative spectral features from Mel-spectrograms while Transformer encoders capture long-range temporal dependencies, and this combination yields 97.8% accuracy together with a macro F1-score of 0.98 when tested on the EYASE corpus of Egyptian Arabic emotional speech.

What carries the argument

Hybrid CNN-Transformer stack that processes Mel-spectrogram inputs through convolutional feature extraction followed by Transformer attention layers for temporal modeling.

If this is right

The hybrid design improves emotion classification accuracy over purely convolutional baselines on Arabic data.
Transformer attention layers successfully handle temporal structure in speech even when training data are limited.
The approach supports real-time emotion-aware interfaces for Arabic-language applications.
Results indicate that attention mechanisms remain useful for dialectal speech variations within Arabic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same CNN-Transformer pattern could be tested on other low-resource languages once modest emotion corpora become available.
Practical deployment would need separate checks for robustness across recording environments and speaker demographics not covered in EYASE.
Efficiency measurements on mobile hardware would clarify whether the model fits latency constraints in live voice interfaces.

Load-bearing premise

The EYASE corpus supplies a representative and unbiased sample of Arabic emotional speech that is large enough to train the hybrid model without overfitting to its particular characteristics.

What would settle it

Running the identical trained model on an independent Arabic speech emotion dataset collected under different conditions and obtaining markedly lower accuracy would show the reported performance does not generalize.

Figures

Figures reproduced from arXiv: 2604.07357 by Oussama Mustapha Benouddane, Samiya Silarbi, Youcef Soufiane Gheffari.

**Figure 2.** Figure 2: An example of a Mel-spectrogram extracted from an Arabic emotional [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training and validation loss/accuracy curves of the CNN–Transformer [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Confusion matrix of the CNN–Transformer on the test set. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Recognizing emotions from speech using machine learning has become an active research area due to its importance in building human-centered applications. However, while many studies have been conducted in English, German, and other European and Asian languages, research in Arabic remains scarce because of the limited availability of annotated datasets. In this paper, we present an Arabic Speech Emotion Recognition (SER) system based on a hybrid CNN-Transformer architecture. The model leverages convolutional layers to extract discriminative spectral features from Mel-spectrogram inputs and Transformer encoders to capture long-range temporal dependencies in speech. Experiments were conducted on the EYASE (Egyptian Arabic speech emotion) corpus, and the proposed model achieved 97.8% accuracy and a macro F1-score of 0.98. These results demonstrate the effectiveness of combining convolutional feature extraction with attention-based modeling for Arabic SER and highlight the potential of Transformer-based approaches in low-resource languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies a standard CNN-Transformer hybrid to Arabic SER and reports 97.8% accuracy on EYASE, but the missing split and baseline details make the result hard to interpret.

read the letter

This paper takes a hybrid CNN-Transformer that has already been tried in other speech tasks and runs it on the EYASE Egyptian Arabic corpus. The model uses CNN layers on Mel-spectrograms to pull out local features and then Transformer encoders to handle longer temporal patterns, which is a reasonable way to combine the two. They report 97.8% accuracy and 0.98 macro F1, which is the headline result worth noting for anyone tracking low-resource SER work. Arabic datasets are genuinely scarce, so showing that this architecture can produce strong numbers on one of them is a small but practical step forward. The abstract is clear about the motivation and the architecture choice makes sense on paper. The soft spot is the evaluation. There is no description of the train/test split, whether it was speaker-independent, how many speakers or utterances are involved, any cross-validation, or even simple baselines like a plain CNN or LSTM. On a corpus that is likely modest in size, that absence leaves the high accuracy open to the usual concerns about overfitting or leakage. The stress-test note is right to flag the unverified protocol. Without those controls the number is difficult to treat as evidence of generalization. This is mainly for readers already working on speech emotion recognition in Arabic or similar languages who want to see one more data point on hybrid models. It does not introduce new architecture ideas or theory, so it will not move the broader field. A serious editor should send it to peer review only if the authors add the missing experimental details, baselines, and split statistics. The core idea is straightforward enough that a revised version with proper validation could be worth referee time, but the current version is too thin to stand alone.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a hybrid CNN-Transformer architecture for Arabic speech emotion recognition. Convolutional layers extract discriminative spectral features from Mel-spectrogram inputs while Transformer encoders capture long-range temporal dependencies. Experiments on the EYASE corpus report 97.8% accuracy and a macro F1-score of 0.98.

Significance. If the evaluation protocol is sound and results are reproducible with proper controls, the work would advance SER research for low-resource languages by illustrating the utility of combining local convolutional feature extraction with attention-based temporal modeling.

major comments (2)

[Abstract] Abstract: The central performance claim (97.8% accuracy, macro F1 0.98) is presented without any information on data splits, cross-validation, speaker independence, dataset size/speaker count, or baseline comparisons. This renders the result uninterpretable as evidence of generalization rather than potential overfitting or leakage on the modest EYASE corpus.
[Experiments] Experiments section: No description of the train/test protocol, error bars, or comparisons to prior Arabic SER methods is supplied, which is load-bearing for the claim that the hybrid model is effective.

minor comments (1)

[Abstract] Abstract: Consider adding one sentence on key hyperparameters or input preprocessing to improve clarity for readers unfamiliar with the EYASE corpus.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in our experimental reporting. We will revise the manuscript to address these points by adding the required details on dataset characteristics, evaluation protocols, and comparisons, thereby improving the interpretability and reproducibility of our results.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim (97.8% accuracy, macro F1 0.98) is presented without any information on data splits, cross-validation, speaker independence, dataset size/speaker count, or baseline comparisons. This renders the result uninterpretable as evidence of generalization rather than potential overfitting or leakage on the modest EYASE corpus.

Authors: We agree that the abstract would benefit from additional context to allow proper assessment of the results. In the revised manuscript, we will expand the abstract to include the EYASE corpus details (speaker count and total utterances), the speaker-independent partitioning strategy, the cross-validation approach employed, and a brief reference to baseline comparisons. This will clarify that the reported metrics reflect generalization rather than overfitting or leakage. revision: yes
Referee: [Experiments] Experiments section: No description of the train/test protocol, error bars, or comparisons to prior Arabic SER methods is supplied, which is load-bearing for the claim that the hybrid model is effective.

Authors: We acknowledge the absence of these details in the original Experiments section. We will revise the section to fully describe the train/test protocol (including speaker-independent splits and cross-validation folds), report results with error bars (standard deviation across folds), and add quantitative comparisons against prior Arabic SER methods from the literature. These additions will substantiate the hybrid architecture's effectiveness. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical performance claim

full rationale

The paper presents a hybrid CNN-Transformer model for Arabic SER and reports an empirical result of 97.8% accuracy and 0.98 macro F1 on the EYASE corpus. The provided text contains no equations, derivations, fitted parameters renamed as predictions, or self-citations that reduce any claim to its own inputs by construction. The result is a direct experimental outcome rather than an analytical chain, so no circular steps exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Central claim rests on empirical performance of the hybrid architecture on the EYASE dataset; no free parameters, axioms, or invented entities are stated or required in the abstract.

pith-pipeline@v0.9.0 · 5466 in / 959 out tokens · 26267 ms · 2026-05-14T22:22:06.297051+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modali- ties, and classifiers,

M. B. Akc ¸ay and K. O ˘guz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modali- ties, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020

work page 2020
[2]

Official Languages,

United Nations, “Official Languages,” 2023. [Online]. Available: https://www.un.org/en/our-work/official-languages

work page 2023
[3]

A survey on dialect Arabic processing and analysis: Recent advances and future trends,

A. Dahou, A. H. H. Dahou, M. A. Cheragui, A. Abdedaiem, M. A. A. Al-qaness, M. Abd Elaziz, A. A. Ewees, and Z. Zhonglong, “A survey on dialect Arabic processing and analysis: Recent advances and future trends,” ACM Trans. Asian Low-Resour. Lang. Inf. Process., 2025

work page 2025
[4]

Unsupervised learning in cross-corpus acoustic emotion recognition,

Z. Zhang, F. Weninger, M. W ¨ollmer, and B. Schuller, “Unsupervised learning in cross-corpus acoustic emotion recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2011, pp. 2049– 2053

work page 2011
[5]

The INTERSPEECH 2010 Paralinguistic Challenge,

B. Schuller, S. Steidl, A. Batliner, et al., “The INTERSPEECH 2010 Paralinguistic Challenge,” in Proc. INTERSPEECH, 2010, pp. 2794– 2797

work page 2010
[6]

Electron spectroscopy studies on magneto-optical media and plastic substrate interface,

Y . Yorozu, M. Hirano, K. Oka, and Y . Tagawa, “Electron spectroscopy studies on magneto-optical media and plastic substrate interface,” IEEE Transl. J. Magn. Japan, vol. 2, pp. 740–741, August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982]

work page 1987
[7]

Deep learning for robust feature generation in audiovisual emotion recognition,

Y . Kim, H. Lee, and E. M. Provost, “Deep learning for robust feature generation in audiovisual emotion recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2013, pp. 3687–3691

work page 2013
[8]

Deep representation learning for speech emotion recognition,

S. Latif, “Deep representation learning for speech emotion recognition,” Ph.D. dissertation, University of Southern Queensland, 2022

work page 2022
[9]

Deep imbalanced learning for multimodal emotion recognition in conversations,

T. Meng, Y . Shou, W. Ai, N. Yin, and K. Li, “Deep imbalanced learning for multimodal emotion recognition in conversations,” IEEE Transactions on Artificial Intelligence, 2024

work page 2024
[10]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5998–6008

work page 2017
[11]

Conformer: Convolution-augmented transformers for speech recognition,

A. Gulati, J. Qin, C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, et al., “Conformer: Convolution-augmented transformers for speech recognition,” in Proc. INTERSPEECH, 2020, pp. 5036–5040

work page 2020
[12]

Arabic speech emotion recognition using deep neural network,

O. Mahmoudi and M. F. Bouami, “Arabic speech emotion recognition using deep neural network,” in *Proc. Int. Conf. on Digital Technologies and Applications*, 2023, pp. 124–133

work page 2023
[13]

Deep learning, ensemble and supervised machine learning for Arabic speech emotion recognition,

W. Ismaiel, A. Alhalangy, A. O. Y . Mohamed, and A. I. A. Musa, “Deep learning, ensemble and supervised machine learning for Arabic speech emotion recognition,” Eng. Technol. Appl. Sci. Res., vol. 14, no. 2, pp. 13757–13764, 2024

work page 2024
[14]

Efficient Arabic emotion recognition using deep neural networks,

Y . Hifny and A. Ali, “Efficient Arabic emotion recognition using deep neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2019, pp. 6710–6714

work page 2019
[15]

Speech emotion recognition using hidden Markov models,

T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognition using hidden Markov models,” Speech Commun., vol. 41, no. 4, pp. 603–623, 2003

work page 2003
[16]

Emotion models: a review,

S. PS and G. Mahalakshmi, “Emotion models: a review,” Int. J. Control Theory Appl., vol. 10, no. 8, pp. 651–657, 2017

work page 2017
[17]

OpenSMILE – The Munich versatile and fast open-source audio feature extractor,

F. Eyben, M. W ¨ollmer, and B. Schuller, “OpenSMILE – The Munich versatile and fast open-source audio feature extractor,” in Proc. ACM Multimedia, 2010, pp. 1459–1462

work page 2010
[18]

Learning the speech front-end with raw waveform CLDNNs,

T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, “Learning the speech front-end with raw waveform CLDNNs,” in Proc. INTERSPEECH, 2015, pp. 1–5

work page 2015
[19]

The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social signals, conflict, emo- tion, autism,

B. Schuller, S. Steidl, A. Batliner, et al., “The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social signals, conflict, emo- tion, autism,” in Proc. INTERSPEECH, 2013, pp. 148–152

work page 2013
[20]

L. R. Rabiner and R. W. Schafer, *Theory and Applications of Digital Speech Processing*. Upper Saddle River, NJ, USA: Pearson, 2010

work page 2010
[21]

Short-time phase spectrum in speech processing: A review and some experimental results,

L. D. Alsteris and K. K. Paliwal, “Short-time phase spectrum in speech processing: A review and some experimental results,” Digit. Signal Process., vol. 17, no. 3, pp. 578–616, 2007

work page 2007
[22]

Real-time signal processing for multiantenna systems: algorithms, op- timization, and implementation on an experimental test-bed,

T. Haustein, A. Forck, H. G ¨abler, V . Jungnickel, and S. Schifferm¨uller, “Real-time signal processing for multiantenna systems: algorithms, op- timization, and implementation on an experimental test-bed,”EURASIP Journal on Advances in Signal Processing, vol. 2006, no. 1, p. 027573, 2006

work page 2006
[23]

A scale for the measurement of the psychological magnitude pitch,

S. Stevens, J. V olkmann, and E. Newman, “A scale for the measurement of the psychological magnitude pitch,” J. Acoust. Soc. Am., vol. 8, no. 3, pp. 185–190, 1937

work page 1937
[24]

Speech recognition with deep recurrent neural networks,

A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2013, pp. 6645–6649

work page 2013
[25]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Adv. Neural Inf. Process. Syst. (NeurIPS), 2020

work page 2020
[26]

Improvements to deep convolutional neural networks for LVCSR,

T. N. Sainath, B. Kingsbury, A. Mohamed, G. E. Dahl, G. Saon, H. Soltau, T. Beran, A. Y . Aravkin, and B. Ramabhadran, “Improvements to deep convolutional neural networks for LVCSR,” inProc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2013, pp. 315–320

work page 2013
[27]

Convrnn-t: Convolutional augmented recurrent neural network transducers for streaming speech recognition,

M. Radfar, R. Barnwal, R. V . Swaminathan, F.-J. Chang, G. P. Strimel, N. Susanj, and A. Mouchtaris, “Convrnn-t: Convolutional augmented recurrent neural network transducers for streaming speech recognition,” arXiv preprint arXiv:2209.14868, 2022

work page arXiv 2022
[28]

Convolution-augmented transformer for semi-supervised sound event detection,

K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Convolution-augmented transformer for semi-supervised sound event detection,” in Proc. Workshop Detection Classification Acoust. Scenes Events (DCASE), 2020, pp. 100–104

work page 2020
[29]

Egyptian Arabic speech emotion recognition using prosodic, spectral, and wavelet features,

L. Abdel-Hamid, “Egyptian Arabic speech emotion recognition using prosodic, spectral, and wavelet features,”Speech Communication, vol. 122, pp. 19–35, 2020

work page 2020

[1] [1]

Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modali- ties, and classifiers,

M. B. Akc ¸ay and K. O ˘guz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modali- ties, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020

work page 2020

[2] [2]

Official Languages,

United Nations, “Official Languages,” 2023. [Online]. Available: https://www.un.org/en/our-work/official-languages

work page 2023

[3] [3]

A survey on dialect Arabic processing and analysis: Recent advances and future trends,

A. Dahou, A. H. H. Dahou, M. A. Cheragui, A. Abdedaiem, M. A. A. Al-qaness, M. Abd Elaziz, A. A. Ewees, and Z. Zhonglong, “A survey on dialect Arabic processing and analysis: Recent advances and future trends,” ACM Trans. Asian Low-Resour. Lang. Inf. Process., 2025

work page 2025

[4] [4]

Unsupervised learning in cross-corpus acoustic emotion recognition,

Z. Zhang, F. Weninger, M. W ¨ollmer, and B. Schuller, “Unsupervised learning in cross-corpus acoustic emotion recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2011, pp. 2049– 2053

work page 2011

[5] [5]

The INTERSPEECH 2010 Paralinguistic Challenge,

B. Schuller, S. Steidl, A. Batliner, et al., “The INTERSPEECH 2010 Paralinguistic Challenge,” in Proc. INTERSPEECH, 2010, pp. 2794– 2797

work page 2010

[6] [6]

Electron spectroscopy studies on magneto-optical media and plastic substrate interface,

Y . Yorozu, M. Hirano, K. Oka, and Y . Tagawa, “Electron spectroscopy studies on magneto-optical media and plastic substrate interface,” IEEE Transl. J. Magn. Japan, vol. 2, pp. 740–741, August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982]

work page 1987

[7] [7]

Deep learning for robust feature generation in audiovisual emotion recognition,

Y . Kim, H. Lee, and E. M. Provost, “Deep learning for robust feature generation in audiovisual emotion recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2013, pp. 3687–3691

work page 2013

[8] [8]

Deep representation learning for speech emotion recognition,

S. Latif, “Deep representation learning for speech emotion recognition,” Ph.D. dissertation, University of Southern Queensland, 2022

work page 2022

[9] [9]

Deep imbalanced learning for multimodal emotion recognition in conversations,

T. Meng, Y . Shou, W. Ai, N. Yin, and K. Li, “Deep imbalanced learning for multimodal emotion recognition in conversations,” IEEE Transactions on Artificial Intelligence, 2024

work page 2024

[10] [10]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5998–6008

work page 2017

[11] [11]

Conformer: Convolution-augmented transformers for speech recognition,

A. Gulati, J. Qin, C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, et al., “Conformer: Convolution-augmented transformers for speech recognition,” in Proc. INTERSPEECH, 2020, pp. 5036–5040

work page 2020

[12] [12]

Arabic speech emotion recognition using deep neural network,

O. Mahmoudi and M. F. Bouami, “Arabic speech emotion recognition using deep neural network,” in *Proc. Int. Conf. on Digital Technologies and Applications*, 2023, pp. 124–133

work page 2023

[13] [13]

Deep learning, ensemble and supervised machine learning for Arabic speech emotion recognition,

W. Ismaiel, A. Alhalangy, A. O. Y . Mohamed, and A. I. A. Musa, “Deep learning, ensemble and supervised machine learning for Arabic speech emotion recognition,” Eng. Technol. Appl. Sci. Res., vol. 14, no. 2, pp. 13757–13764, 2024

work page 2024

[14] [14]

Efficient Arabic emotion recognition using deep neural networks,

Y . Hifny and A. Ali, “Efficient Arabic emotion recognition using deep neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2019, pp. 6710–6714

work page 2019

[15] [15]

Speech emotion recognition using hidden Markov models,

T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognition using hidden Markov models,” Speech Commun., vol. 41, no. 4, pp. 603–623, 2003

work page 2003

[16] [16]

Emotion models: a review,

S. PS and G. Mahalakshmi, “Emotion models: a review,” Int. J. Control Theory Appl., vol. 10, no. 8, pp. 651–657, 2017

work page 2017

[17] [17]

OpenSMILE – The Munich versatile and fast open-source audio feature extractor,

F. Eyben, M. W ¨ollmer, and B. Schuller, “OpenSMILE – The Munich versatile and fast open-source audio feature extractor,” in Proc. ACM Multimedia, 2010, pp. 1459–1462

work page 2010

[18] [18]

Learning the speech front-end with raw waveform CLDNNs,

T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, “Learning the speech front-end with raw waveform CLDNNs,” in Proc. INTERSPEECH, 2015, pp. 1–5

work page 2015

[19] [19]

The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social signals, conflict, emo- tion, autism,

B. Schuller, S. Steidl, A. Batliner, et al., “The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social signals, conflict, emo- tion, autism,” in Proc. INTERSPEECH, 2013, pp. 148–152

work page 2013

[20] [20]

L. R. Rabiner and R. W. Schafer, *Theory and Applications of Digital Speech Processing*. Upper Saddle River, NJ, USA: Pearson, 2010

work page 2010

[21] [21]

Short-time phase spectrum in speech processing: A review and some experimental results,

L. D. Alsteris and K. K. Paliwal, “Short-time phase spectrum in speech processing: A review and some experimental results,” Digit. Signal Process., vol. 17, no. 3, pp. 578–616, 2007

work page 2007

[22] [22]

Real-time signal processing for multiantenna systems: algorithms, op- timization, and implementation on an experimental test-bed,

T. Haustein, A. Forck, H. G ¨abler, V . Jungnickel, and S. Schifferm¨uller, “Real-time signal processing for multiantenna systems: algorithms, op- timization, and implementation on an experimental test-bed,”EURASIP Journal on Advances in Signal Processing, vol. 2006, no. 1, p. 027573, 2006

work page 2006

[23] [23]

A scale for the measurement of the psychological magnitude pitch,

S. Stevens, J. V olkmann, and E. Newman, “A scale for the measurement of the psychological magnitude pitch,” J. Acoust. Soc. Am., vol. 8, no. 3, pp. 185–190, 1937

work page 1937

[24] [24]

Speech recognition with deep recurrent neural networks,

A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2013, pp. 6645–6649

work page 2013

[25] [25]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Adv. Neural Inf. Process. Syst. (NeurIPS), 2020

work page 2020

[26] [26]

Improvements to deep convolutional neural networks for LVCSR,

T. N. Sainath, B. Kingsbury, A. Mohamed, G. E. Dahl, G. Saon, H. Soltau, T. Beran, A. Y . Aravkin, and B. Ramabhadran, “Improvements to deep convolutional neural networks for LVCSR,” inProc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2013, pp. 315–320

work page 2013

[27] [27]

Convrnn-t: Convolutional augmented recurrent neural network transducers for streaming speech recognition,

M. Radfar, R. Barnwal, R. V . Swaminathan, F.-J. Chang, G. P. Strimel, N. Susanj, and A. Mouchtaris, “Convrnn-t: Convolutional augmented recurrent neural network transducers for streaming speech recognition,” arXiv preprint arXiv:2209.14868, 2022

work page arXiv 2022

[28] [28]

Convolution-augmented transformer for semi-supervised sound event detection,

K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Convolution-augmented transformer for semi-supervised sound event detection,” in Proc. Workshop Detection Classification Acoust. Scenes Events (DCASE), 2020, pp. 100–104

work page 2020

[29] [29]

Egyptian Arabic speech emotion recognition using prosodic, spectral, and wavelet features,

L. Abdel-Hamid, “Egyptian Arabic speech emotion recognition using prosodic, spectral, and wavelet features,”Speech Communication, vol. 122, pp. 19–35, 2020

work page 2020