pith. sign in

arxiv: 2509.08470 · v2 · submitted 2025-09-10 · 📡 eess.AS · cs.AI

Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition

Pith reviewed 2026-05-18 18:03 UTC · model grok-4.3

classification 📡 eess.AS cs.AI
keywords speech emotion recognitionspeech enhancementmulti-task learningmixture of expertsnoisy speechself-supervised representations
0
0 comments X

The pith

Frame-wise expert routing on self-supervised features lets one model improve both speech enhancement and emotion recognition in noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Sparse MERIT, a multi-task setup that routes each audio frame through a shared pool of experts using task-specific gates. This dynamic selection aims to avoid the usual clashes between cleaning noisy speech and extracting emotional content that plague standard shared models or separate enhancement steps. By working directly on self-supervised representations, the approach keeps parameter counts low while adapting the learned features to both goals at once. Experiments on the MSP-Podcast corpus with added noise show gains on both tasks, most clearly when the signal-to-noise ratio drops to minus five decibels. Readers interested in real-world voice systems would care because the method removes the need to choose between preprocessing artifacts and task interference.

Core claim

Sparse MERIT uses task-specific gating networks to perform frame-wise dynamic selection from a shared expert pool applied to self-supervised speech representations, enabling joint optimization of speech enhancement and speech emotion recognition that reduces gradient interference and representational conflicts.

What carries the argument

Sparse MERIT: mixture-of-experts architecture with frame-wise routing through task-specific gating networks over a shared expert pool for adaptive representation learning.

If this is right

  • At -5 dB SNR the method raises SER F1-macro by 12.0 percent over a separate enhancement baseline and 3.4 percent over naive multi-task learning, with gains holding on unseen noises.
  • Segmental SNR for the enhancement task rises 28.2 percent over the pre-processing baseline and 20.0 percent over the naive multi-task baseline.
  • Both tasks improve at the same time rather than trading off performance.
  • The routing remains effective when the test noises differ from those seen during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing idea could be tested on other paired audio tasks such as dereverberation paired with speaker verification.
  • Increasing the number of experts while keeping the gating sparse might further reduce conflicts if extra compute is available.
  • Replacing the current self-supervised front-end with a different pretrained encoder would test whether the routing benefit depends on the specific input features.

Load-bearing premise

Dynamic frame-level routing through a shared expert pool can separate the needs of enhancement and emotion recognition without dropping task-critical details from the input features.

What would settle it

If a non-routed shared-backbone model trained on the same data and self-supervised features matches or exceeds Sparse MERIT on both F1-macro and segmental SNR at -5 dB SNR on unseen noise, the benefit of the mixture routing would be called into question.

Figures

Figures reproduced from arXiv: 2509.08470 by Carlos Busso, Chi-Chun Lee, Jing-Tong Tzeng.

Figure 1
Figure 1. Figure 1: The proposed Sparse MERIT framework for enhanced speech emotion recognition, leveraging unified self-supervised speech representations through [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Expert usage distributions across SNR conditions. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Frame-level expert usage distributions across emotion classes. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions. Although speech enhancement (SE) can improve robustness, it often introduces artifacts that obscure emotional cues and adds computational overhead to the pipeline. Multi-task learning (MTL) offers an alternative by jointly optimizing SE and SER tasks. However, conventional shared-backbone models frequently suffer from gradient interference and representational conflicts between tasks. To address these challenges, we propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations. Sparse MERIT incorporates task-specific gating networks that dynamically select from a shared pool of experts for each frame, enabling parameter-efficient and task-adaptive representation learning. Experiments on the MSP-Podcast corpus show that Sparse MERIT consistently outperforms baseline models on both SER and SE tasks. Under the most challenging condition of -5 dB signal-to-noise ratio (SNR), Sparse MERIT improves SER F1-macro by an average of 12.0% over a baseline relying on a SE pre-processing strategy, and by 3.4% over a naive MTL baseline, with statistical significance on unseen noise conditions. For SE, Sparse MERIT improves segmental SNR (SSNR) by 28.2% over the SE pre-processing baseline and by 20.0% over the naive MTL baseline. These results demonstrate that Sparse MERIT provides robust and generalizable performance for both emotion recognition and enhancement tasks in noisy environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Sparse MERIT, a multi-task learning framework that uses a sparse mixture-of-experts architecture with frame-wise dynamic routing over self-supervised speech representations to jointly perform speech enhancement (SE) and speech emotion recognition (SER). Task-specific gating networks select experts from a shared pool to mitigate gradient interference and representational conflicts between the tasks. Experiments on the MSP-Podcast corpus under noisy conditions, including unseen noise at low SNRs, report consistent gains over SE pre-processing and naive MTL baselines, with specific improvements at -5 dB SNR of 12.0% in SER F1-macro and 28.2% in segmental SNR (SSNR).

Significance. If the reported gains are reproducible, the work offers a practical advance for robust emotion recognition in noisy environments by avoiding artifacts from separate enhancement stages and reducing task conflicts in joint training. The parameter-efficient MoE routing on self-supervised features is a timely contribution to multi-task speech processing, and the evaluation on public data with statistical significance claims on held-out conditions strengthens the case for generalizability.

major comments (2)
  1. [Abstract and §4] Abstract and §4: The performance claims (e.g., 12.0% SER F1-macro gain and 28.2% SSNR gain at -5 dB SNR) are load-bearing for the central contribution, yet the manuscript provides no details on exact baseline implementations, hyperparameter search procedures, or training protocols for the SE pre-processing and naive MTL baselines; this limits independent verification of the improvements.
  2. [§3.2] §3.2: The frame-wise routing mechanism is described as resolving representational conflicts without losing task-critical information, but the paper does not include an ablation isolating the effect of sparsity or dynamic selection versus a dense shared backbone; without this, it is unclear whether the gains stem from the proposed routing or from other factors such as increased capacity.
minor comments (2)
  1. [§3] Notation for the gating network and expert selection should be clarified with explicit equations to distinguish the task-specific gates from the shared expert pool.
  2. [§5] Figure captions and axis labels in the results section could be expanded to indicate the exact noise types and SNR levels used in each panel for easier interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and indicate the revisions we will make to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4: The performance claims (e.g., 12.0% SER F1-macro gain and 28.2% SSNR gain at -5 dB SNR) are load-bearing for the central contribution, yet the manuscript provides no details on exact baseline implementations, hyperparameter search procedures, or training protocols for the SE pre-processing and naive MTL baselines; this limits independent verification of the improvements.

    Authors: We agree that greater detail on the baselines is required for reproducibility. In the revised manuscript we will expand §4 to specify the exact architectures and training configurations of the SE pre-processing models, the naive MTL baseline, the hyperparameter search procedure (including ranges and selection method), and the full training protocols (optimizer, learning-rate schedule, batch size, and epochs). These additions will appear in the main text or supplementary material as space allows. revision: yes

  2. Referee: [§3.2] §3.2: The frame-wise routing mechanism is described as resolving representational conflicts without losing task-critical information, but the paper does not include an ablation isolating the effect of sparsity or dynamic selection versus a dense shared backbone; without this, it is unclear whether the gains stem from the proposed routing or from other factors such as increased capacity.

    Authors: The referee correctly notes the absence of a direct ablation against a capacity-matched dense backbone. We will add this comparison in the revised version, reporting results for a dense shared-backbone model with parameter count comparable to Sparse MERIT. The new ablation will quantify the contribution of frame-wise sparsity and task-specific gating to the observed gains and will be discussed in §4. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces Sparse MERIT as an empirical MTL architecture with frame-wise MoE routing over self-supervised features for joint SE and SER. All reported results consist of measured performance deltas (F1-macro, SSNR) on the external MSP-Podcast corpus under controlled noisy conditions, including unseen noise, with explicit baseline comparisons and statistical significance. No equations, uniqueness theorems, or first-principles derivations appear in the provided text; the method is presented as a proposed framework whose value is established by external validation rather than by reducing any quantity to a fitted parameter or self-citation by construction. The reader's assessment of score 2.0 is consistent with this self-contained empirical structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract provides insufficient detail to enumerate specific free parameters or invented entities beyond the high-level description of the proposed technique.

axioms (1)
  • domain assumption Self-supervised speech representations contain sufficient information for both enhancement and emotion recognition tasks
    The method applies expert routing directly over these representations as the shared input.
invented entities (1)
  • Sparse MERIT framework no independent evidence
    purpose: Joint multi-task learning architecture using sparse MoE
    Newly proposed technique combining task-specific gating with shared experts.

pith-pipeline@v0.9.0 · 5822 in / 1392 out tokens · 53077 ms · 2026-05-18T18:03:27.629828+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 6 internal anchors

  1. [1]

    Adaptive virtual assistant interaction through real-time speech emotion analysis using hybrid deep learning models and contextual awareness,

    E. K. Zadeh and M. Alaeifard, “Adaptive virtual assistant interaction through real-time speech emotion analysis using hybrid deep learning models and contextual awareness,”International Journal of Advanced Human Computer Interaction, vol. 1, no. 1, pp. 1–15, 2023

  2. [2]

    Real-time speech emotion analysis for smart home assistants,

    R. Chatterjee, S. Mazumdar, R. S. Sherratt, R. Halder, T. Maitra, and D. Giri, “Real-time speech emotion analysis for smart home assistants,” IEEE Transactions on Consumer Electronics, vol. 67, no. 1, pp. 68–76, 2021

  3. [3]

    Enabling intelligent environment by the design of emotionally aware virtual assistant: A case of smart campus,

    P.-S. Chiu, J.-W. Chang, M.-C. Lee, C.-H. Chen, and D.-S. Lee, “Enabling intelligent environment by the design of emotionally aware virtual assistant: A case of smart campus,”IEEE Access, vol. 8, pp. 62 032–62 041, 2020

  4. [4]

    Multilayer neural network based speech emotion recognition for smart assistance

    S. Kumar, M. A. Haq, A. Jain, C. A. Jason, N. R. Moparthi, N. Mittal, and Z. S. Alzamil, “Multilayer neural network based speech emotion recognition for smart assistance.”Computers, Materials & Continua, vol. 75, no. 1, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

  5. [5]

    Speech emotion recognition using supervised deep recurrent system for mental health monitoring,

    N. Elsayed, Z. ElSayed, N. Asadizanjani, M. Ozer, A. Abdelgawad, and M. Bayoumi, “Speech emotion recognition using supervised deep recurrent system for mental health monitoring,” in2022 IEEE 8th World Forum on Internet of Things (WF-IoT), 2022, pp. 1–6

  6. [6]

    Automatic speech emotion recognition using machine learning: digital transformation of mental health,

    S. Madanian, D. Parry, O. Adeleye, C. Poellabauer, F. Mirza, S. Mathew, and S. Schneider, “Automatic speech emotion recognition using machine learning: digital transformation of mental health,” inProceedings of the Annual Pacific Asia Conference on Information Systems (PACIS), 2022

  7. [7]

    Depression severity classification from speech emotion,

    S. Harati, A. Crowell, H. Mayberg, and S. Nemati, “Depression severity classification from speech emotion,” in2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2018, pp. 5763–5766

  8. [8]

    Speech emotion recognition for power customer service,

    X. Li and R. Lin, “Speech emotion recognition for power customer service,” in2021 7th International Conference on Computer and Com- munications (ICCC), 2021, pp. 514–518

  9. [9]

    Ordinal learning for emotion recognition in customer service calls,

    W. Han, T. Jiang, Y . Li, B. Schuller, and H. Ruan, “Ordinal learning for emotion recognition in customer service calls,” inICASSP 2020- 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2020, pp. 6494–6498

  10. [10]

    End-to-end continuous speech emotion recog- nition in real-life customer service call center conversations,

    Y . Feng and L. Devillers, “End-to-end continuous speech emotion recog- nition in real-life customer service call center conversations,” in2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), 2023, pp. 1–8

  11. [11]

    Front-end feature compensation and denoising for noise robust speech emotion recognition,

    R. Chakraborty, A. Panda, M. Pandharipande, S. Joshi, and S. K. Kopparapu, “Front-end feature compensation and denoising for noise robust speech emotion recognition,” inInterspeech 2019, 2019, pp. 3257–3261

  12. [12]

    Emotion recognition in the noise applying large acoustic feature sets,

    B. Schuller, D. Arsic, F. Wallhoff, and G. Rigoll, “Emotion recognition in the noise applying large acoustic feature sets,” inSpeech Prosody 2006, 2006, p. paper 128

  13. [13]

    Not all features are equal: Selection of robust features for speech emotion recognition in noisy environments,

    S.-G. Leem, D. Fulford, J.-P. Onnela, D. Gard, and C. Busso, “Not all features are equal: Selection of robust features for speech emotion recognition in noisy environments,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022), Singapore, May 2022, pp. 6447–6451

  14. [14]

    Metricaug: A distortion metric-lead augmen- tation strategy for training noise-robust speech emotion recognizer,

    Y .-T. Wu and C.-C. Lee, “Metricaug: A distortion metric-lead augmen- tation strategy for training noise-robust speech emotion recognizer,” in Proc. INTERSPEECH, vol. 2023, 2023, pp. 3587–3591

  15. [15]

    Best practices for noise-based augmen- tation to improve the performance of deployable speech-based emotion recognition systems,

    M. Jaiswal and E. M. Provost, “Best practices for noise-based augmen- tation to improve the performance of deployable speech-based emotion recognition systems,”arXiv preprint arXiv:2104.08806, 2021

  16. [16]

    Reinforcement learning based data augmentation for noise robust speech emotion recognition,

    S. Ranjan, R. Chakraborty, and S. K. Kopparapu, “Reinforcement learning based data augmentation for noise robust speech emotion recognition,” inProc. Interspeech 2024, 2024, pp. 1040–1044

  17. [17]

    Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions,

    U. Tiwari, M. Soni, R. Chakraborty, A. Panda, and S. K. Kopparapu, “Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7194–7198

  18. [18]

    Com- putation and memory efficient noise adaptation of Wav2Vec2.0 for noisy speech emotion recognition with skip connection adapters,

    S.-G. Leem, D. Fulford, J.-P. Onnela, D. Gard, and C. Busso, “Com- putation and memory efficient noise adaptation of Wav2Vec2.0 for noisy speech emotion recognition with skip connection adapters,” in Interspeech 2023, Dublin, Ireland, August 2023, pp. 1888–1892

  19. [19]

    Describe where you are: Improving noise-robustness for speech emotion recognition with text description of the environment,

    ——, “Describe where you are: Improving noise-robustness for speech emotion recognition with text description of the environment,”ArXiv e-prints (arXiv:2407.17716), pp. 1–14, July 2024

  20. [20]

    Towards noise robust speech emotion recog- nition using dynamic layer customization,

    A. Wilf and E. M. Provost, “Towards noise robust speech emotion recog- nition using dynamic layer customization,” in2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII), 2021, pp. 1–8

  21. [21]

    Separation of emotional and reconstruction embeddings on ladder network to improve speech emotion recognition robustness in noisy conditions,

    S.-G. Leem, D. Fulford, J.-P. Onnela, D. Gard, and C. Busso, “Separation of emotional and reconstruction embeddings on ladder network to improve speech emotion recognition robustness in noisy conditions,” inInterspeech 2021, Brno, Czech Republic, August-September 2021, pp. 2871–2875

  22. [22]

    Adapting a self-supervised speech representation for noisy speech emotion recognition by using contrastive teacher-student learning,

    ——, “Adapting a self-supervised speech representation for noisy speech emotion recognition by using contrastive teacher-student learning,” in IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP 2023), Rhodes island, Greece, June 2023, pp. 1–5

  23. [23]

    End-to-end speech emotion recognition: Challenges of real-life emergency call centers data recordings,

    T. Deschamps-Berger, L. Lamel, and L. Devillers, “End-to-end speech emotion recognition: Challenges of real-life emergency call centers data recordings,” in2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII), 2021, pp. 1–8

  24. [24]

    Enhancing emergency response through speech emotion recognition: A machine learning approach,

    P. Deb, H. Mahrin, and A. R. Bhuiyan, “Enhancing emergency response through speech emotion recognition: A machine learning approach,” in2023 26th International Conference on Computer and Information Technology (ICCIT), 2023, pp. 1–5

  25. [25]

    Investigating trans- former encoders and fusion strategies for speech emotion recognition in emergency call center conversations

    T. Deschamps-Berger, L. Lamel, and L. Devillers, “Investigating trans- former encoders and fusion strategies for speech emotion recognition in emergency call center conversations.” inCompanion Publication of the 2022 International Conference on Multimodal Interaction, 2022, pp. 144–153

  26. [26]

    Towards robust speech emotion recognition using deep resid- ual networks for speech enhancement,

    A. Triantafyllopoulos, G. Keren, J. Wagner, I. Steiner, and B. W. Schuller, “Towards robust speech emotion recognition using deep resid- ual networks for speech enhancement,” inInterspeech 2019, 2019, pp. 1691–1695

  27. [27]

    Task-specific speech enhancement and data augmentation for improved multimodal emotion recognition under noisy conditions,

    S. Kshirsagar, A. Pendyala, and T. H. Falk, “Task-specific speech enhancement and data augmentation for improved multimodal emotion recognition under noisy conditions,”Frontiers in Computer Science, vol. 5, p. 1039261, 2023

  28. [28]

    Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement,

    Y .-W. Chen, J. Hirschberg, and Y . Tsao, “Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement,” arXiv preprint arXiv:2309.01164, 2023

  29. [29]

    Investigating speech enhancement and perceptual quality for speech emotion recog- nition,

    A. R. Avila, M. J. Alam, D. O’Shaughnessy, and T. Falk, “Investigating speech enhancement and perceptual quality for speech emotion recog- nition,” inInterspeech 2018, 2018, pp. 3663–3667

  30. [30]

    Noise-robust speech emotion recognition using shared self-supervised representations with integrated speech enhancement,

    J.-T. Tzeng, S.-G. Leem, A. N. Salman, C.-C. Lee, and C. Busso, “Noise-robust speech emotion recognition using shared self-supervised representations with integrated speech enhancement,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  31. [31]

    Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks,

    Z. Chen, V . Badrinarayanan, C.-Y . Lee, and A. Rabinovich, “Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks,” inInternational conference on machine learning. PMLR, 2018, pp. 794–803

  32. [32]

    Gra- dient surgery for multi-task learning,

    T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gra- dient surgery for multi-task learning,”Advances in neural information processing systems, vol. 33, pp. 5824–5836, 2020

  33. [33]

    Modeling task relationships in multi-task learning with multi-gate mixture-of-experts,

    J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi, “Modeling task relationships in multi-task learning with multi-gate mixture-of-experts,” inProceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 1930–1939

  34. [34]

    Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech From Existing Podcast Recordings,

    R. Lotfian and C. Busso, “Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech From Existing Podcast Recordings,”IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, October-December 2019

  35. [35]

    Odyssey 2024 - speech emotion recognition challenge: Dataset, baseline framework, and results,

    L. Goncalves, A. Salman, A. Reddy Naini, L. Moro-Velazquez, T. The- baud, P. Garcia, N. Dehak, B. Sisman, and C. Busso, “Odyssey 2024 - speech emotion recognition challenge: Dataset, baseline framework, and results,” inThe Speaker and Language Recognition Workshop (Odyssey 2024), Quebec, Canada, June 2024, pp. 247–254

  36. [36]

    Selective acoustic feature enhancement for speech emotion recognition with noisy speech,

    S.-G. Leem, D. Fulford, J.-P. Onnela, D. Gard, and C. Busso, “Selective acoustic feature enhancement for speech emotion recognition with noisy speech,”IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 32, pp. 917–929, 2024

  37. [37]

    Robust front-end processing for emotion recognition in noisy speech,

    M. Pandharipande, R. Chakraborty, A. Panda, and S. K. Kopparapu, “Robust front-end processing for emotion recognition in noisy speech,” in2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2018, pp. 324–328

  38. [38]

    An unsupervised frame selection technique for robust emotion recognition in noisy speech,

    ——, “An unsupervised frame selection technique for robust emotion recognition in noisy speech,” in2018 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 2055–2059

  39. [39]

    Keep, delete, or substitute: Frame selection strategy for noise-robust speech emotion recognition,

    S.-G. Leem, D. Fulford, J. Onnela, D. Gard, and C. Busso, “Keep, delete, or substitute: Frame selection strategy for noise-robust speech emotion recognition,” inInterspeech 2024, Kos Island, Greece, September 2024, pp. 3734–3738

  40. [40]

    From neural pca to deep unsupervised learning,

    H. Valpola, “From neural pca to deep unsupervised learning,” in Advances in independent component analysis and learning machines. Elsevier, 2015, pp. 143–171

  41. [41]

    Semi-supervised speech emotion recog- nition with ladder networks,

    S. Parthasarathy and C. Busso, “Semi-supervised speech emotion recog- nition with ladder networks,”IEEE/ACM transactions on audio, speech, and language processing, vol. 28, pp. 2697–2709, 2020

  42. [42]

    Domain separation networks,

    K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan, “Domain separation networks,” inAdvances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29. Curran Associates, Inc.,

  43. [43]

    Available: https://proceedings.neurips.cc/paper files/ paper/2016/file/45fbc6d3e05ebd93369ce542e8f2322d-Paper.pdf

    [Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2016/file/45fbc6d3e05ebd93369ce542e8f2322d-Paper.pdf

  44. [44]

    Spectral feature mapping with mimic loss for robust speech recognition,

    D. Bagchi, P. Plantinga, A. Stiff, and E. Fosler-Lussier, “Spectral feature mapping with mimic loss for robust speech recognition,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5609–5613

  45. [45]

    Capturing long-term temporal dependencies with convolutional JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 networks for continuous emotion recognition,

    S. Khorram, Z. Aldeneh, D. Dimitriadis, M. McInnis, and E. M. Provost, “Capturing long-term temporal dependencies with convolutional JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 networks for continuous emotion recognition,” inInterspeech 2017, 2017, pp. 1253–1257

  46. [46]

    Versa- tile audio-visual learning for emotion recognition,

    L. Goncalves, S.-G. Leem, W.-C. Lin, B. Sisman, and C. Busso, “Versa- tile audio-visual learning for emotion recognition,”IEEE Transactions on Affective Computing, vol. 16, no. 1, pp. 306–318, January-March 2025

  47. [47]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020

  48. [48]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

  49. [49]

    Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  50. [50]

    arXiv preprint arXiv:2105.01051 , year=

    S.-w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Linet al., “Superb: Speech processing universal performance benchmark,”arXiv preprint arXiv:2105.01051, 2021

  51. [51]

    SUPERB-SG: Enhanced speech processing universal PERformance benchmark for semantic and generative capabilities,

    H.-S. Tsai, H.-J. Chang, W.-C. Huang, Z. Huang, K. Lakhotia, S.-w. Yang, S. Dong, A. Liu, C.-I. Lai, J. Shi, X. Chang, P. Hall, H.-J. Chen, S.-W. Li, S. Watanabe, A. Mohamed, and H.-y. Lee, “SUPERB-SG: Enhanced speech processing universal PERformance benchmark for semantic and generative capabilities,” inProceedings of the 60th Annual Meeting of the Assoc...

  52. [52]

    Boosting self-supervised embeddings for speech enhancement,

    K.-H. Hung, S. wei Fu, H.-H. Tseng, H.-T. Chiang, Y . Tsao, and C.-W. Lin, “Boosting self-supervised embeddings for speech enhancement,” in Interspeech 2022, 2022, pp. 186–190

  53. [53]

    Exploring wavlm on speech enhancement,

    H. Song, S. Chen, Z. Chen, Y . Wu, T. Yoshioka, M. Tang, J. W. Shin, and S. Liu, “Exploring wavlm on speech enhancement,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 451– 457

  54. [54]

    Speech enhancement using self- supervised pre-trained model and vector quantization,

    X.-Y . Zhao, Q.-S. Zhu, and J. Zhang, “Speech enhancement using self- supervised pre-trained model and vector quantization,” in2022 Asia- Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022, pp. 330–334

  55. [55]

    A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding.arXiv preprint arXiv:2111.02735,

    Y . Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spo- ken language understanding,”arXiv preprint arXiv:2111.02735, 2021

  56. [56]

    Evaluating self-supervised speech repre- sentations for speech emotion recognition,

    B. T. Atmaja and A. Sasou, “Evaluating self-supervised speech repre- sentations for speech emotion recognition,”IEEE Access, vol. 10, pp. 124 396–124 407, 2022

  57. [57]

    The Interspeech 2025 challenge on speech emotion recog- nition in naturalistic conditions,

    A. Reddy Naini, L. Goncalves, A. Salman, P. Mote, I. ¨Ulgen, T. The- baud, L. Moro-Velazquez, L. Garcia, N. Dehak, B. Sisman, and C. Busso, “The Interspeech 2025 challenge on speech emotion recog- nition in naturalistic conditions,” inInterspeech 2025, vol. accepted, Rotterdam, The Netherlands, August 2025

  58. [58]

    Multi-task Sequence to Sequence Learning

    M.-T. Luong, Q. V . Le, I. Sutskever, O. Vinyals, and L. Kaiser, “Multi- task sequence to sequence learning,”arXiv preprint arXiv:1511.06114, 2015

  59. [59]

    One Model To Learn Them All

    L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit, “One model to learn them all,”arXiv preprint arXiv:1706.05137, 2017

  60. [60]

    Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics,

    A. Kendall, Y . Gal, and R. Cipolla, “Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7482–7491

  61. [61]

    2022 , journal =

    S. Gupta, S. Mukherjee, K. Subudhi, E. Gonzalez, D. Jose, A. H. Awadallah, and J. Gao, “Sparsely activated mixture-of-experts are robust multi-task learners,”arXiv preprint arXiv:2204.07689, 2022

  62. [62]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

    W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

  63. [63]

    Mixtral of Experts

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of experts,”arXiv preprint arXiv:2401.04088, 2024

  64. [64]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  65. [65]

    Llama-moe: Building mixture-of-experts from llama with continual pre- training,

    T. Zhu, X. Qu, D. Dong, J. Ruan, J. Tong, C. He, and Y . Cheng, “Llama-moe: Building mixture-of-experts from llama with continual pre- training,”arXiv preprint arXiv:2406.16554, 2024

  66. [66]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wuet al., “Deepseekmoe: Towards ultimate expert spe- cialization in mixture-of-experts language models,”arXiv preprint arXiv:2401.06066, 2024

  67. [67]

    M 3vit: Mixture-of-experts vision transformer for effi- cient multi-task learning with model-accelerator co-design,

    H. Liang, Z. Fan, R. Sarkar, Z. Jiang, T. Chen, K. Zou, Y . Cheng, C. Hao, Z. Wanget al., “M 3vit: Mixture-of-experts vision transformer for effi- cient multi-task learning with model-accelerator co-design,”Advances in Neural Information Processing Systems, vol. 35, pp. 28 441–28 457, 2022

  68. [68]

    Speechmoe: Scaling to large acoustic models with dynamic routing mixture of experts,

    Z. You, S. Feng, D. Su, and D. Yu, “Speechmoe: Scaling to large acoustic models with dynamic routing mixture of experts,” inInterspeech 2021, 2021, pp. 2077–2081

  69. [69]

    Speechmoe2: Mixture-of-experts model with improved routing,

    ——, “Speechmoe2: Mixture-of-experts model with improved routing,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7217–7221

  70. [70]

    Language-routing mixture of experts for multilingual and code-switching speech recognition,

    W. Wang, G. Ma, Y . Li, and B. Du, “Language-routing mixture of experts for multilingual and code-switching speech recognition,” in Interspeech 2023, 2023, pp. 1389–1393

  71. [71]

    Mixture-of-expert conformer for streaming multilingual asr,

    K. Hu, B. Li, T. Sainath, Y . Zhang, and F. Beaufays, “Mixture-of-expert conformer for streaming multilingual asr,” inInterspeech 2023, 2023, pp. 3327–3331

  72. [72]

    Attentive statistics pooling for deep speaker embedding,

    K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” inInterspeech 2018, 2018, pp. 2252–2256

  73. [73]

    Boosting objective scores of a speech enhancement model by metricgan post-processing,

    S.-W. Fu, C.-F. Liao, T.-A. Hsieh, K.-H. Hung, S.-S. Wang, C. Yu, H.-C. Kuo, R. E. Zezario, Y .-J. Li, S.-Y . Chuanget al., “Boosting objective scores of a speech enhancement model by metricgan post-processing,” in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2020, pp. 455–459

  74. [74]

    Gating Neural Network for Large V ocabulary Audiovisual Speech Recognition,

    F. Tao and C. Busso, “Gating Neural Network for Large V ocabulary Audiovisual Speech Recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 7, pp. 1290–1302, 2018

  75. [75]

    Freesound datasets: a platform for the creation of open audio datasets,

    E. Fonseca, J. Pons Puig, X. Favory, F. Font Corbera, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra, “Freesound datasets: a platform for the creation of open audio datasets,” inHu X, Cunningham SJ, Turnbull D, Duan Z, editors. Proceedings of the 18th ISMIR Conference; 2017 oct 23-27; Suzhou, China.[Canada]: International Society for Music In...

  76. [76]

    Icassp 2023 deep noise suppression challenge,

    H. Dubey, A. Aazami, V . Gopal, B. Naderi, S. Braun, R. Cutler, A. Ju, M. Zohourian, M. Tang, M. Golestanehet al., “Icassp 2023 deep noise suppression challenge,”IEEE Open Journal of Signal Processing, vol. 5, pp. 725–737, 2024

  77. [77]

    Investi- gating RNN-based speech enhancement methods for noise-robust Text- to-Speech,

    C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investi- gating RNN-based speech enhancement methods for noise-robust Text- to-Speech,” inProc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 2016, pp. 146–152

  78. [78]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017

  79. [79]

    Mod-squad: Designing mixtures of experts as modular multi-task learners,

    Z. Chen, Y . Shen, M. Ding, Z. Chen, H. Zhao, E. G. Learned-Miller, and C. Gan, “Mod-squad: Designing mixtures of experts as modular multi-task learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 828–11 837